vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.07k stars 3.27k forks source link

[New Model]: Chameleon support #5721

Open nopperl opened 3 weeks ago

nopperl commented 3 weeks ago

The model to consider.

https://huggingface.co/facebook/chameleon (as of now, the models can be downloaded using the model form)

Chameleon is an interesting multimodal model architecture based on Llama 2. It adds image inputs and outputs to Llama 2 by tokenizing images using a VQ-VAE and adding the codebook to Llama's tokenizer vocabulary. In principle, it supports text and images as input and output in arbitrary combination. However, the released models were finetuned to prevent image generation.

The closest model vllm already supports.

LlamaForCausalLM

What's your difficulty of supporting the model you want?

For text->text support, the implementation should actually be fairly easy. The model is based on Llama-2 with the following differences:

To enable image inputs, image tokenization using the provided VQ-VAE needs to be added.

Further info:

mgoin commented 3 weeks ago

Apparently there is ongoing work by HF to get this architecture into transformers https://github.com/huggingface/transformers/issues/31505 Once that is implementation is made we can look at porting it over.

ywang96 commented 3 weeks ago

Chameleon PR ontransformers to track https://github.com/huggingface/transformers/pull/31534

ywang96 commented 3 weeks ago

I made a PR https://github.com/vllm-project/vllm/pull/5770 based off the transformers PR as an initial text-only support for this model and will wait for HuggingFace to release the weights to verify the implementation.

For the VQVAE - I plan to add it in next PR if that makes sense.