vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
4.88k stars 433 forks source link

Could you compare with MoE-LLaVA-1.6B×4-Top2? It seems more better? #42

Closed llziss4ai closed 7 months ago

llziss4ai commented 7 months ago
Model Activated Param Resolution VQAv2 GQA VizWiz T-VQA
MoE-LLaVA-1.6B×4-Top2 2.0B 336 76.7 60.3 36.2 50.1
moondream 1.6B 384 74.3 56.3 30.3 39.8

I just found its results from https://github.com/PKU-YuanGroup/MoE-LLaVA/tree/main?tab=readme-ov-file#-model-zoo

sujitvasanth commented 7 months ago

there is also https://huggingface.co/YouLiXiya/tinyllava-v1.0-1.1b-hf https://huggingface.co/bczhou/tiny-llava-v1-hf

both of which run natively from hf transformers and can be quantized to 4 bit with bitsandbytes they occupy 2-3gb vram and presumably can be fine-tuned using the llava github examples

currently MoE-LLaVA-1.6B×4-Top2 required deepspeed for inference and cant be quantised although the author is asking for help to do it.

vikhyat commented 7 months ago

couldn't get the code to run so i can't repro these benchmarks

sujitvasanth commented 7 months ago

@vikhyat may be some thing to learn from MoeLlava as utilises different llm backbones including phi2 and openchat also mixture of experts architecture seems to have reduced hallucinations

was able to get it running pretty easy just clone the repo, cd into it and run deepspeed predict.py had to redirect the image and modelname as below

image = '/home/sujit/Downloads/MoE-LLaVA-main/moellava/serve/examples/extreme_ironing.jpg'
inp = 'What is unusual about this image?'
model_path = 'LanguageBind/MoE-LLaVA-StableLM-1.6B-4e'  # choose a model