jam-cc / MMAD

The Codes and Data of The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection
36 stars 2 forks source link

MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection

Industry_Inspection MMAD Gemini GPT-4o

arXiv Hugging Face

Our benchmark responds to the following questions:

๐Ÿ“œ News

๐Ÿ‘€ Overview

In the field of industrial inspection, Multimodal Large Language Models (MLLMs) have a high potential to renew the paradigms in practical applications due to their robust language capabilities and generalization abilities. However, despite their impressive problem-solving skills in many domains, MLLMs' ability in industrial anomaly detection has not been systematically studied. To bridge this gap, we present MMAD, the first-ever full-spectrum MLLMs benchmark in industrial Anomaly Detection. We defined seven key subtasks of MLLMs in industrial inspection and designed a novel pipeline to generate the MMAD dataset with 39,672 questions for 8,366 industrial images. With MMAD, we have conducted a comprehensive, quantitative evaluation of various state-of-the-art MLLMs.

๐Ÿ“ Dataset Examples

We collected 8,366 samples from 38 classes of industrial products across 4 public datasets, generating a total of 39,672 multiple-choice questions in 7 key subtasks.

๐Ÿ”ฎ Evaluation Pipeline

1. Data Preparation

Prepare the evaluation dataset by following the instructions provided in the README.md file located in the dataset folder.

Or you can directly download the dataset from Hugging Face.

cd dataset
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/jiang-cc/MMAD

Or download the ZIP file:

cd dataset/MMAD
wget -O ALL_DATA.zip https://huggingface.co/datasets/jiang-cc/MMAD/resolve/refs%2Fpr%2F1/ALL_DATA.zip?download=true
unzip ALL_DATA.zip

2. Model Configuration

Due to different MLLMs' input and output handling methods, we have created separate example files for each MLLM being tested, which can be found in the evaluation folder.

For Gemini and GPT4, an API KEY is required and should be provided in the respective file.

For Cambrain, LLaVA, and SPHINX, the environment must be set up as per the original repository. (Here are the addresses to refer to: Cambrain, LLaVA-NeXT, SPHINX)

For Qwen, MiniCPM, InternVL, and similar models, simply install the transformers library (pip install transformers).

3. Run Evaluation

Each test file uses the --model-path argument to specify the model, and --few_shot_model to indicate the number of normal samples in the prompt.

Examples:

cd ./evaluation/examples/Transformers
python internvl_query.py --model-path ../../InternVL/pretrained/InternVL2-1B

cd ./evaluation/examples/LLaVA_Query
python llava_query.py --model-path ../../LLaVA/llava-v1.6-34b/ --dtype 4bit

๐Ÿ‘จโ€๐Ÿ’ป Todo

BibTex Citation

If you find this paper and repository useful, please cite our paperโ˜บ๏ธ.

@inproceedings{Jiang2024MMADTF,
  title={MMAD: The First-Ever Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection},
  author={Xi Jiang and Jian Li and Hanqiu Deng and Yong Liu and Bin-Bin Gao and Yifeng Zhou and Jialin Li and Chengjie Wang and Feng Zheng},
  year={2024},
  journal={arXiv preprint arXiv:2410.09453},
}