In the rapidly evolving realm of multimodal language models, the need for efficient fine-tuning methodologies becomes increasingly vital. This work concentrates on refining the utilization of pre-trained multimodal vision language models, specifically optimizing their performance under hardware resource constraints. Acknowledging that harnessing powerful models often requires substantial computational resources and extensive datasets, our objective is to democratize the fine-tuning process, ensuring accessibility and impact across diverse applications. Through targeted experiments in Visual Question Answering (VQAv2) and medical domain tasks using A-OKVQA and PubMedQA datasets, we navigate the delicate balance between performance and resource efficiency. Leveraging the OpenFlamingo framework, our work explores the potential of large pre-trained Visual Language Models (VLMs) through component substitution and domain adaptation experiments.
pip install -r requirements.txt
and then run jupyter notebooks:
eval_script(VQAv2).ipynb
evaluates the model on VQAv2 Val dataset. Using open-flamingo framework with various PubMedQA_dataset_script.ipynb
is used for fine-tuning the model on PubMedQA datasetrun_open-flamingo.ipynb
creates an interface to query the open-flamingo model and generate results