I find Microsoft's Phi 3.5 vision instruct performs much better than Florence 2. Since it's an instruct model, it also has the benefit of taking text instruction as input to help describing the images with the desired syntax.
Since you already have a dataset, maybe it could be interesting to finetune this model too 😀
I find Microsoft's Phi 3.5 vision instruct performs much better than Florence 2. Since it's an instruct model, it also has the benefit of taking text instruction as input to help describing the images with the desired syntax.
Since you already have a dataset, maybe it could be interesting to finetune this model too 😀
https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Just sharing the idea! Thank you for sharing your work <3