I couldn't find the code about Video Encoder in llama 3.2 vision

blurmemo commented 5 days ago

llama 3.2 vision is a good work! I am doing some interesting work based on llama 3.2 vision. I have read paper about llama 3.2 vision, but I have a very important question to ask.

Below is a image of the model architecture for image-text input

question 1: Can I input only image and answer, no text? question 2: For video input, after Image Encoder, the encoding results are sent to video branch. I couldn't find out codes about handling Video Image Encoder output branch(That's the red box in the image above) in the HuggingFace implementation(implementation is in the HuggingFace's transformers repository and llama 3.2 vision model path is "transformers/src/transformers/model/mllama"), Can you help telling me the code location?

I really look forward to getting your help eagerly, thank you!

HamidShojanazeri commented 5 days ago

@blurmemo thanks for your interest. The paper describe overall vision for llama 3 family of models the llama 3.2 is image reasoning only.

re 1: you need to send the image and prompt as suggested here.

re 2: the llama 3.2 only works with images and one image at a time.

Hope that helps to clarify a bit.

blurmemo commented 5 days ago

@blurmemo thanks for your interest. The paper describe overall vision for llama 3 family of models the llama 3.2 is image reasoning only.

re 1: you need to send the image and prompt as suggested here.

re 2: the llama 3.2 only works with images and one image at a time.

Hope that helps to clarify a bit.

@HamidShojanazeri Thank you for your help! I think I may not have expressed very clear in two questions. For the question 1, when I fine tuning basing llama-3.2-vision, I want to construct my data set with image-text pairs. The image is natural sceen and the text is only the description/content/other of image, so I constrcut as following(raw data, no process). [{ "images": image, "texts": [ { "assistant": "this is image description_1" }, { "assistant": "this is image content" }, { "assistant": "this is other" } ] }, ...] I do not set key-value pair for every text in the values associated with the key="texts", including "user": "this is question or other" and "system":"criterion or other". So I want to know whether "system":"criterion or other" pair is supported and whether "user": "this is question or other" is must be added in every text when I fine tuning as shown below. [{ "images": image, "texts": [ { "system": "criterion or other" }, { "user": "" or "this is question or other", "assistant": "this is image description_1" }, { "user": "" or "this is question or other", "assistant": "this is image content" }, { "user": "" or "this is question or other", "assistant": "this is other" } ] },...]

For the question 2, I want input with multiple images-text pairs or video-text pairs when I fine tuning.As shown below. [{ "images": [image_1, image_2, ..., image_n], "texts": [ ... ] },...] (multiple images-text pairs) or [{ "video": [video frames], "texts": [ ... ] }, ...] (video-text pairs)

For input with multiple images-text pairs, Can I modify code to extract images pathces in IMAGE ENCODER and add if images(different from if image) branch to handle IMAGE ENCODER output and send processed output to Cross-attention(in the LANGUAGE MODEL) and fine tuning on my data set so that realizing multiple images-text pairs input?

For input with video-text pairs, I want to know the implementation codes whether the official implementation codes or HuggingFace implementation codes about the red box in the image(from meta paper) below. If the implementation codes are not provided, I agree and would like to get your confirmation.

Above are some additional notes from me, I am doing some interesting work based on llama-3.2-vision and hope for your help. Thank you!

meta-llama / llama-recipes

I couldn't find the code about Video Encoder in llama 3.2 vision #795