[Question] Finetuning with Custom Image dataset for VQA

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

19.36k stars 2.13k forks source link

[Question] Finetuning with Custom Image dataset for VQA #170

Open RajdeepBorgohain opened 1 year ago

RajdeepBorgohain commented 1 year ago

Question

So, I want to finetune this model with our own custom image dataset, which is mostly design images, and we want to give the ability to users to ask questions based on the image. At this point, LLaVA 13B cannot produce the expected results. And we are planning to create a set of 3000 images with Questions and Answers and want to improve the Model.

Please share your thoughts; what type of infra is required if we want to fine-tune 7B model?

haotian-liu commented 1 year ago

Hi @RajdeepBorgohain, thank you for your interest in our work.

You can try to run this with 8x A100s, and for 7B model, you shall be able to train with A100 (40G)x8.

We are also working to support more hardwares with DeepSpeed. It is targeted by the end of this month or early next month. Stay tuned if you are interest in this.

Please let me know if there are other questions, thanks!

RajdeepBorgohain commented 1 year ago

Hi Thanks a lot for your reply :) I have another question, so I am using your pretrained model "LLaVA-7b-delta-v0" check checkpoint for fine-tuning, and I am trying to do that with 1xA100(40G) since I have a very small dataset of 80 images and instructions. And I got errors like "torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ".

Also, we are experimenting with Web UI images. Can you share any tips to improve the performance of the model on this type of image?

Also, we found that the 13B model is failing to answer questions on images which have a lot of text, can you share how can we improve on this?