vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.88k stars 433 forks source link

How does the model work? #84

Open yukiarimo opened 5 months ago

yukiarimo commented 5 months ago

Can somebody please explain how the Moondream model works? For example, if my context window is 1024 tokens, and the model uses 724 for the image, how? The image is 378x378 (it’s a lot of pixels). How do you put an image into a text using such a small amount of tokens?

KPCOFGS commented 5 months ago

I don't think it's just the Moondream model, but llava models in general. Here is a website that you can look into: https://llava-vl.github.io/ repo owner apparently is not yet ready to release the training code, mentioned in issue #11 You can also open a new issue and copy and paste your question in https://github.com/haotian-liu/LLaVA/issues.

yukiarimo commented 5 months ago

is not yet ready to release the training code

Then, what is this about https://github.com/vikhyat/moondream/blob/main/notebooks/Finetuning.ipynb?

I don't think it's just the Moondream model, but llava models in general

I see. There’s a lot of math going on. If you know, can you please just simply explain the main idea of how it works in one sentence?

KPCOFGS commented 5 months ago

Then, what is this about https://github.com/vikhyat/moondream/blob/main/notebooks/Finetuning.ipynb?

It is fine-tuning. Fine-tuning is basically further training the model with your specific dataset. For example, if there is something you want but the model does not offer, this is where fine-tuning comes in. And it is better than training a model from plain scratch

Training the model requires a large dataset to be useful, even with multiple gpu, the process usually takes hours, even days, to finish.

Fine-tuning requires a sufficient amount of dataset as well. For example, if you are doing a fine-tuning task for computer vision, your specific dataset could be ranging from 2000 to 10000 images. That is quite large, but still a very tiny amount comparing to training a model from scratch. And, fine-tuning can be accomplished with one good gpu and some time, which is not time consuming and is less expensive.

Here are some sources for fine tuning vs training a model: https://chrisalbon.com/Large+Language+Models/Fine-Tuning+Vs.+Training https://www.ibm.com/topics/fine-tuning

I see. There’s a lot of math going on. If you know, can you please just simply explain the main idea of how it works in one sentence?

Unfortunately I do not know the math as well. But I do suggest to open another issue and ask that question in https://github.com/haotian-liu/LLaVA/issues. Those contributors are much more talented in this field and they probably know much more than I do

CoderCowMoo commented 5 months ago

Hi, a basic and maybe slightly incorrect explanation is that the image is embedded using SigLIP in this case, and the embedding is input into the prompt in between some text prompts (approx "Image:\nPrompt:").

Now this can't be done immediately after attaching an image embedding model and llm together, so it is trained on examples of using the dimensions of the image embeddings to answer questions. Please feel free to ask any questions and/or correct me if I'm wrong.

yukiarimo commented 4 months ago

approx "Image:\nPrompt:"

I got it. But if your image is 512x512, it's 262144 pixels in total, which is A LOT more than the context window. How do you embed the image? Is it using something similar to RAG for iteration, or is it not?

CoderCowMoo commented 4 months ago

It processes the images in batches, but I'm not sure exactly how batches are combined.

Looking into the original ViT paper, it seems that each patch is treated similar to how language models treat tokens, and a positional encoding is applied to each patch as well. So maybe token count increases for higher resolution images? Still learning so sorry if unsatisfactory.

RonanKMcGovern commented 4 months ago

I got it. But if your image is 512x512, it's 262144 pixels in total, which is A LOT more than the context window. How do you embed the image? Is it using something similar to RAG for iteration, or is it not?

I can't see the code but most likely it breaks the image into patches and each patch goes through an encoder to generate an embedding vector for that patch. Each patch might be 64x64 or 32x32, so there aren't all that many patches required for a 512*512 image - and those patches are then transformed in the vision encoder to give a similar number of output vectors that go into the language model. There's one embedding vector for a given patch.