AMBER is An LLM-free Multi-dimensional Benchmark for MLLMs hallucination evaluation, which can be used to evaluate both generative task and discriminative task including existence, attribute and relation hallucination. AMBER has a fine-grained annotation and automated evaluation pipeline. The data statistics and objects distribution. The results of mainstream MLLMs evaluated by AMBER.
1. spacy is used for near-synonym judgment
pip install -U spacy
python -m spacy download en_core_web_lg
2. nltk is used for objects extraction
pip install nltk
Download the images from this LINK.
json file | Task or Dimension | Evaluation args |
---|---|---|
query_all.json | All the tasks and dimensions | a |
query_generative.json | Generative task | g |
query_discriminative.json | Discriminative task | d |
query_discriminative-existence.json | Existence dimension | de |
query_discriminative-attribute.json | Attribute dimension | da |
query_discriminative-relation.json | Relation dimension | dr |
For generative task (1 <= id <= 1004), the format of responses is:
[
{
"id": 1,
"response": "The description of AMBER_1.jpg from MLLM."
},
......
{
"id": 1004,
"response": "The description of AMBER_1004.jpg from MLLM."
}
]
For discriminative task (id >= 1005), the format of responses is:
[
{
"id": 1005,
"response": "Yes" or "No"
},
......
{
"id": 15220,
"response": "Yes" or "No"
}
]
python inference.py --inference_data path/to/your/inference/file --evaluation_type {Evaluation args}
If you found this work useful, consider giving this repository a star and citing our paper as followed:
@article{wang2023llm,
title={An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation},
author={Wang, Junyang and Wang, Yuhang and Xu, Guohai and Zhang, Jing and Gu, Yukai and Jia, Haitao and Yan, Ming and Zhang, Ji and Sang, Jitao},
journal={arXiv preprint arXiv:2311.07397},
year={2023}
}