HEMM: Holistic Evaluation of Multimodal Foundation Models
Paul Pu Liang, Akshay Goindani, Talha Chafekar, Leena Mathur, Haofei Yu, Ruslan Salakhutdinov, Louis-Philippe Morency
Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world domains. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this work, we introduce Holistic Evaluation of Multimodal Models (HEMM) as a framework to systematically evaluate the capabilities of multimodal foundation models across a set of 3 comprehensive dimensions: basic skills, information flow, and real-world use cases.
Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge.
Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion.
Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications.
Overall, HEMM's collection of 30 datasets enables a systematic evaluation of today's multimodal foundation models. Through comprehensive experiments of many models across HEMM tasks, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence downstream task performance. These findings yield important conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the benefits of instruction-tuning.
If you find this repository useful, please cite the corresponding paper:
@article{liang2024hemm,
title={HEMM: Holistic Evaluation of Multimodal Foundation Models},
author={Liang, Paul Pu and Goindani, Akshay and Chafekar, Talha and Mathur, Leena and Yu, Haofei and Salakhutdinov, Ruslan and Morency, Louis-Philippe},
journal={arXiv preprint arXiv:2407.03418},
year={2024}
}
We categorize the datasets based on individual dimensions such as Use Case, Multimodal Interaction, Granularity of Multimodal Alignment, Level of Reasoning, Cross-Modal Information Flow, and External Knowledge. Below are the categories for the dimensions
Modeling dimensions and categories are as follows:
HEMM currently supports the following datasets
Follow these steps to add a new dataset:
HEMM currently supports the following open-source Multimodal Foundation Models
For our analysis, we also evaluate the closed models - GPT-4V and Gemini 1.0 Pro Vision.
Follow these steps to add new models to HEMM:
HEMM currently supports the following metrics for text generation tasks. Since BARTScore has the highest human correlation amongst these metrics, we use BARTScore for our analysis.
For Image generation tasks, HEMM supports Mean Squared Error and CLIP-I score.
We perform our analysis on text-generation tasks and compute the BARTScore(generation, ground truth) of the models on all the tasks. For each task, we then normalize the scores using min-max scaling, where min represents the score of the worst performing model and max corresponds to the identity score (BARTScore(ground truth, ground truth)).
Create a virtual environment and install dependencies.
python -m venv env
source env/bin/activate
pip install -r requirements.txt
cd HEMM
Note: We use some datasets from Huggingface and Kaggle. Make sure to get your api key from Huggingface and Kaggle. Provide the Huggingface Authorization token (hf_auth_token) and the path (kaggle_api_path) of the directory where kaggle.json is stored in the load_dataset_evaluator
Sample code:
from hemm.utils.base_utils import load_model, load_dataset_evaluator
from hemm.metrics.bartscore_metric import BartScoreMetric
model_key = 'blip2'
model = load_model(model_key, download_dir="./")
model.load_weights()
dataset_name = 'hateful_memes'
dataset_evaluator = load_dataset_evaluator(dataset_name,
download_dir="./",
kaggle_api_path=kaggle_api_path,
hf_auth_token=hf_auth_token,)
## For single data point evaluation
predictions, ground_truth = dataset_evaluator.evaluate_dataset(model=model)
metric = BartScoreMetric()
bart_score = metric.compute(predictions, ground_truth)
## For batching evaluation (if model supports batched inference)
results = dataset_evaluator.evaluate_dataset_batched(model=model, batch_size=32)
print(results)
coming soon!