Closed deven367 closed 1 week ago
@deven367 Great to hear from you again and excited to hear that you are building with 3.2 models.
Our recommendation is to run inference with 1 image at a time, you might see a degradation in response quality with more images.
At this time, with HF there is support for "chatting with 1 image" at once.
Here is a WIP example that you can use for multi-GPU labelling.
Let me know if you have Qs!
Thanks a lot @init27, this really helps! I might bug you again in case I run into something 😅
@init27 I've run into an interesting problem again. I ran the multi-GPU labelling script and I've run into a speed issue. On a node with 4xA100 40GB, I was only able to annotate ~18k images in 24 hours. At this rate, I won't be able to annotate over a 1M images. I was reading in the transformers
docs, regarding some notes on torch.compile
which could speed up the inference. Do you think it's a path worth exploring? TIA! :)
That's an interesting experiment-Yes I believe that might be an interesting experiment-let us know how it goes!
🚀 The feature, motivation and pitch
The current recipe given for multi-modal inference can only be used for a single image at a time
from here → multimodal-inference
I wish to use this model for running inference on over 1M images.
I was playing around with the
MllamaProcessor
object, and I was able to process multiple images at once, however it is not clear to me how the conversation (with theapply_chat_template
) would be used in this case.This was my attempt at trying batch inference (I've taken most of the code the multimodal inference script)
The code snippet does work, but it is not able to describe the 10 images, maybe due to limit of
max_new_tokens
. Any thoughts @init27 ?Alternatives
Would it be ideal to iterate over an image at a time for inference?
Additional context
No response