unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and πŸ”œ video, up to 5x faster than OpenAI CLIP and LLaVA πŸ–ΌοΈ & πŸ–‹οΈ
https://unum-cloud.github.io/uform/
Apache License 2.0
1.03k stars 62 forks source link

Generative models #53

Closed ashvardanian closed 9 months ago

ashvardanian commented 9 months ago

UForm is going Generative!

The UForm family of tiny multimodal transformer models just got bigger! In addition to the existing CLIP-like embedding models, we now have a generative model useful for image captioning, visual question answering, and multimodal chats. All that is in just a billion parameters, small enough to fit even on mobile devices πŸŽ‰

Repository: https://github.com/unum-cloud/uform Generative model: https://huggingface.co/unum-cloud/uform-gen Chat model: https://huggingface.co/unum-cloud/uform-gen-chat

Evaluation Metrics

Unum UForm Gen_ interior

Being the smallest model of its kind, unum-cloud/uform-gen is hard to compare to others. Next in size are the 5x larger LLaVAs and InstructBLIP, with 7 billion parameters. LLaVA performs noticeably better on VQAv2: 78.5 vs 66.5. On captioning, CLIPScore and RefCLIPScore are relatively close across all models.

Model Size Caption Length CLIPScore RefCLIPScore
llava-hf/llava-1.5-7b-hf 7B Long 0.878 0.529
llava-hf/llava-1.5-7b-hf 7B Short 0.886 0.531
Salesforce/instructblip-vicuna-7b 7B Long 0.902 0.534
Salesforce/instructblip-vicuna-7b 7B Short 0.848 0.523
unum-cloud/uform-gen 1.5B Long 0.847 0.523
unum-cloud/uform-gen 1.5B Short 0.842 0.522
unum-cloud/uform-gen-chat 1.5B Long 0.860 0.525
unum-cloud/uform-gen-chat 1.5B Short 0.858 0.525

Throughput

On RTX 3090, using vanilla PyTorch for inference, with bfloat16 arithmetic and greedy decoding, one should expect the following numbers for throughput.

Model Size Speed Speedup
llava-hf/llava-1.5-7b-hf 7B ~ 40 tokens/second
Salesforce/instructblip-vicuna-7b 7B ~ 40 tokens/second
unum-cloud/uform-gen 1.5B ~ 140 tokens/second x 3.5
lin72h commented 9 months ago

Very impressive for 1.5B Model, what's the license for it?

ashvardanian commented 9 months ago

Thank you, @lin72h! It’s Apache 2.0, like the rest.

ashvardanian commented 9 months ago

:tada: This PR is included in version 1.0.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: