unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
982 stars 56 forks source link

Passing labels to text_decoder to compute loss. #65

Closed kapulkin closed 5 months ago

kapulkin commented 5 months ago

I noticed, that labels variable is not passed to text_decoder in VLMForCausalLM.forward(). So text_decoder will return just logits and will not compute loss. This makes impossible to use VLMForCausalLM model with transformer.Trainer and requires to write custom train loop or wrap VLMForCausalLM.

There is a fix to avoid that incompatibility.

kimihailv commented 5 months ago

Passing labels to the text decoder is not enough. input embeds contain not only embeddings of text tokens, but also image features, so logits will also contain not only data for text but also for image

ashvardanian commented 5 months ago

:tada: This PR is included in version 1.1.1 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: