punith300i / nlp-vlm-project

CS544 NLP project Team
0 stars 0 forks source link

MorphVLM: Towards more Efficient and Robust Multimodal Vision Language Models

CSCI-544 NLP Project - Fall 2023

Abstract

In the rapidly evolving realm of multimodal language models, the need for efficient fine-tuning methodologies becomes increasingly vital. This work concentrates on refining the utilization of pre-trained multimodal vision language models, specifically optimizing their performance under hardware resource constraints. Acknowledging that harnessing powerful models often requires substantial computational resources and extensive datasets, our objective is to democratize the fine-tuning process, ensuring accessibility and impact across diverse applications. Through targeted experiments in Visual Question Answering (VQAv2) and medical domain tasks using A-OKVQA and PubMedQA datasets, we navigate the delicate balance between performance and resource efficiency. Leveraging the OpenFlamingo framework, our work explores the potential of large pre-trained Visual Language Models (VLMs) through component substitution and domain adaptation experiments.

Code details

pip install -r requirements.txt and then run jupyter notebooks:

Sample Output

Model Output

Equal Contributors: