Moving VQ-GAE and BPE mapping to ChameleonProcessor

I am relatively new to HuggingFace and the Transformers library, but have many years of ML and software engineering experience, so I expect there are probably legacy reasons for this unintuitive behavior, but I thought I would share my thoughts and offer to contribute the changes if you are interested.

Background

Chameleon operates on a sequence of tokens, so as a user you need to supply the sequence of input tokens to make a prediction. But getting the sequence of tokens is a bit involved. For text we use a text tokenizer, and for images we use a sequence of transformations. Images undergo multiple transformations: center cropping, resizing, normalizing etc. Then this centered, cropped, normalized, image is fed to a Vector Quantized Variational AutoEncoder (VQ-GAE) to generate image tokens which are then mapped to Byte Pair Encoding (BPE) tokens.

Luckily, the ChameleonProcessor provides us with an easy way to make the tokens, or does it?

Intuitively the ChameleonProcessor should give me the sequence of tokens I need to feed to Chameleon....

What does it give me?

The ChameleonProcessor when called on a string and list of images creates a dict with:

key "input_ids": mapped to the token sequence for the text and 1026 tokens for every instance of "\<image>" in the original text.
key "pixel_values": mapped to the tensors of the pixel values for the centered, cropped, normalized images.
key "attention_mask": not relevant to this discussion.

What are these 1026 tokens for every "\<image>"?

They are duplicated placeholder image tokens (1024 replications of the "8711" token) plus two sentinel tokens (beginning and end of image)

I.e. the VQ-GAE and BPE mapping for the images and the substitution is left to later with the call to the Chameleon model.

What would I expect:

I would expect the ChameleonProcessor to finish its task, i.e. to also apply the VQ-GAE, and the BPE mapping and give me back the actual input_ids that will be passed to the model.

I would expect that the input_ids sequence contains the actual image tokens from the and not a sequence of 1024 sentinel tokens to be swapped in later based on the pixel_values during the call to the ChameleonModel.

Alternatively there could be some kind of middle step that combines the "input_ids" and "pixel_values" and gives the actual "input_ids", those that will be passed to the Chameleon model.

Benefits

Mostly this would be more intuitive to users, they would get the actual input_ids and not have any mysterious changing of the tokens in their model call

It would also bring some latency reductions to some applications as the model call would not need to reprocess all the images every time, I.e. you don't need to do the VQ-GAE, BPE, mapping, token replacement on every call which may actually be worthwhile.

Motivation

I am doing a project on crafting adversarial attacks for Chameleon and found this behaviour unintuitive and thought it would be useful to fix it upstream for the world.

Your contribution

I can do the coding if you are interested. If you are open to it I can have a chat with some of your staff and send it in

zucchini-nlp commented 1 month ago

Hey @maxwbuckley !

Thanks for raising this question. Indeed, Chameleon differs from other models we have in the library in that the images also get to be discretized. Yes, doing the VQ-VAE step during processing seems more intuitive but the catch is that processing is supposed to be

firstly framework agnostic, that is it can return np, torch, jax or a simple list and has no hard dependency on anything. If we were to move VQ VAE to processing it would require to run an actual neural network which needs a GPU (for latency) and torch (we only have torch VQ-VAE support)
secondly, running the VAE in modeling code doesn't incur more or less latency in general, because in any case we get to run the model only once. After the image is tokenized, we replace special placeholder in the input ids and go for the autoregressive generation

If one needs to tokenize the image and not use autoregressive generation, the VQ-VAE part is indeed more involved. One would need to load VAE model and the set up a vocab mapping to map vae tokens to bpe tokens. Imo we can try and make VAE a separate easily loadable module with a one line from_pretrained. But moving the whole vae model into processing would be against general library philosophy. LMK what you think about it

maxwbuckley commented 1 month ago

Understood!

Thanks for the explanation. Can I add some more documentation to chameleon.md to make it a bit clearer how this works?

zucchini-nlp commented 1 month ago

@maxwbuckley sure thing, a better documentation would be nice :)

huggingface / transformers

ChameleonProcessor abstraction behaves unintuitively (ChameleonProcessor should also do VQ-GAE and BPE mapping and not ChameleonModel) #33647

Feature request