Open maxwbuckley opened 1 month ago
Hey @maxwbuckley !
Thanks for raising this question. Indeed, Chameleon differs from other models we have in the library in that the images also get to be discretized. Yes, doing the VQ-VAE step during processing seems more intuitive but the catch is that processing is supposed to be
If one needs to tokenize the image and not use autoregressive generation, the VQ-VAE part is indeed more involved. One would need to load VAE model and the set up a vocab mapping to map vae tokens to bpe tokens. Imo we can try and make VAE a separate easily loadable module with a one line from_pretrained
. But moving the whole vae model into processing would be against general library philosophy. LMK what you think about it
Understood!
Thanks for the explanation. Can I add some more documentation to chameleon.md to make it a bit clearer how this works?
@maxwbuckley sure thing, a better documentation would be nice :)
Feature request
Moving VQ-GAE and BPE mapping to ChameleonProcessor
I am relatively new to HuggingFace and the Transformers library, but have many years of ML and software engineering experience, so I expect there are probably legacy reasons for this unintuitive behavior, but I thought I would share my thoughts and offer to contribute the changes if you are interested.
Background
Chameleon operates on a sequence of tokens, so as a user you need to supply the sequence of input tokens to make a prediction. But getting the sequence of tokens is a bit involved. For text we use a text tokenizer, and for images we use a sequence of transformations. Images undergo multiple transformations: center cropping, resizing, normalizing etc. Then this centered, cropped, normalized, image is fed to a Vector Quantized Variational AutoEncoder (VQ-GAE) to generate image tokens which are then mapped to Byte Pair Encoding (BPE) tokens.
Luckily, the ChameleonProcessor provides us with an easy way to make the tokens, or does it?
Intuitively the ChameleonProcessor should give me the sequence of tokens I need to feed to Chameleon....
What does it give me?
The ChameleonProcessor when called on a string and list of images creates a dict with:
What are these 1026 tokens for every "\<image>"?
They are duplicated placeholder image tokens (1024 replications of the "8711" token) plus two sentinel tokens (beginning and end of image)
I.e. the VQ-GAE and BPE mapping for the images and the substitution is left to later with the call to the Chameleon model.
What would I expect:
I would expect the ChameleonProcessor to finish its task, i.e. to also apply the VQ-GAE, and the BPE mapping and give me back the actual input_ids that will be passed to the model.
I would expect that the input_ids sequence contains the actual image tokens from the and not a sequence of 1024 sentinel tokens to be swapped in later based on the pixel_values during the call to the ChameleonModel.
Alternatively there could be some kind of middle step that combines the "input_ids" and "pixel_values" and gives the actual "input_ids", those that will be passed to the Chameleon model.
Benefits
Mostly this would be more intuitive to users, they would get the actual input_ids and not have any mysterious changing of the tokens in their model call
It would also bring some latency reductions to some applications as the model call would not need to reprocess all the images every time, I.e. you don't need to do the VQ-GAE, BPE, mapping, token replacement on every call which may actually be worthwhile.
Motivation
I am doing a project on crafting adversarial attacks for Chameleon and found this behaviour unintuitive and thought it would be useful to fix it upstream for the world.
Your contribution
I can do the coding if you are interested. If you are open to it I can have a chat with some of your staff and send it in