meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
12.12k stars 1.93k forks source link

batch inference for the multi-modal llama #701

Closed deven367 closed 1 week ago

deven367 commented 1 week ago

🚀 The feature, motivation and pitch

The current recipe given for multi-modal inference can only be used for a single image at a time

python multi_modal_infer.py --image_path "./resources/image.jpg" --prompt_text "Describe this image" --temperature 0.5 --top_p 0.8 --model_name "meta-llama/Llama-3.2-11B-Vision-Instruct"

from here → multimodal-inference

I wish to use this model for running inference on over 1M images.

I was playing around with the MllamaProcessor object, and I was able to process multiple images at once, however it is not clear to me how the conversation (with the apply_chat_template) would be used in this case.

This was my attempt at trying batch inference (I've taken most of the code the multimodal inference script)


# define model and processor
model, processor = load_model_and_processor(DEFAULT_MODEL)

# files and processing 
files = globtastic(vid_path, file_glob='*.jpg').sorted()
processed = [ process_image(f) for f in files[:10]]
prompt_text = 'Describe the image'
conversation = [
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": prompt_text}]}
    ]
prompt = processor.apply_chat_template(conversation*10, add_generation_prompt=True, tokenize=False)

inputs = processor(processed, prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, temperature=0.5, top_p=0.8, max_new_tokens=2048)
print(processor.decode(outputs[0])[len(prompt):])

The code snippet does work, but it is not able to describe the 10 images, maybe due to limit of max_new_tokens. Any thoughts @init27 ?

Alternatives

Would it be ideal to iterate over an image at a time for inference?

Additional context

No response

init27 commented 1 week ago

@deven367 Great to hear from you again and excited to hear that you are building with 3.2 models.

Our recommendation is to run inference with 1 image at a time, you might see a degradation in response quality with more images.

At this time, with HF there is support for "chatting with 1 image" at once.

Here is a WIP example that you can use for multi-GPU labelling.

Let me know if you have Qs!

deven367 commented 1 week ago

Thanks a lot @init27, this really helps! I might bug you again in case I run into something 😅

deven367 commented 1 week ago

@init27 I've run into an interesting problem again. I ran the multi-GPU labelling script and I've run into a speed issue. On a node with 4xA100 40GB, I was only able to annotate ~18k images in 24 hours. At this rate, I won't be able to annotate over a 1M images. I was reading in the transformers docs, regarding some notes on torch.compile which could speed up the inference. Do you think it's a path worth exploring? TIA! :)

init27 commented 1 week ago

That's an interesting experiment-Yes I believe that might be an interesting experiment-let us know how it goes!