sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Apache License 2.0
175 stars 12 forks source link

Fine tuning LLAVA for object detection #24

Open dipikakhullar opened 3 months ago

dipikakhullar commented 3 months ago

How do we finetune llava for object detection tasks, or predicting a trajectory of actions. How would that work? Then I need some regression based loss like MSE right ? And instead of outputting text we would want to output a set of coordinates. From other repos it seems like fine tuning for regression tasks doesn’t seem to work well.

sshh12 commented 2 months ago

Hey! I would said multi_token/llava are best suited for taking the features from an image model and embedding them in a chat-able large language model.

If you want to do object detection (rather than chat-with-this-image type tasks), you are best off just using an existing object detection specialized architecture: https://huggingface.co/docs/transformers/en/tasks/object_detection. If you wanted to then do something chat/text based on already detected objects, you could pass the outputs of that object detection model to the LLM.