Fine tuning LLAVA for object detection

sshh12 / multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Apache License 2.0

175 stars 12 forks source link

Hey! I would said multi_token/llava are best suited for taking the features from an image model and embedding them in a chat-able large language model.

If you want to do object detection (rather than chat-with-this-image type tasks), you are best off just using an existing object detection specialized architecture: https://huggingface.co/docs/transformers/en/tasks/object_detection. If you wanted to then do something chat/text based on already detected objects, you could pass the outputs of that object detection model to the LLM.

sshh12 / multi_token

Fine tuning LLAVA for object detection #24