open-compass / VLMEvalKit

Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks
https://huggingface.co/spaces/opencompass/open_vlm_leaderboard
Apache License 2.0
961 stars 136 forks source link

Error about vlmeval/dataset/utils/multiple_choice.py #290

Closed mary-0830 closed 1 month ago

mary-0830 commented 1 month ago

hi, I had problems evaluating the MMT-Bench_VAL_MI, as shown below: abilities: ['navigation', 'image_matting', nan, 'meme_vedio_understanding', 'single_object_tracking', 'counting_by_visual_prompting', 'chart_to_table', 'multiple_view_image_understanding', 'table_structure_recognition', 'vehicle_keypoint_detection', 'furniture_keypoint_detection', 'artwork_emotion_recognition', 'referring_detection', 'spot_the_diff', 'polygon_localization', 'jigsaw_puzzle_solving', 'interactive_segmentation', 'clothes_keypoint_detection', 'pixel_localization', 'chart_vqa', 'point_tracking', 'counting_by_reasoning', 'ravens_progressive_matrices', 'pixel_recognition', 'clock_reading', 'counting_by_category', 'chart_to_text', 'crowd_counting', 'micro_expression_recognition', 'image_colorization', 'body_emotion_recognition', 'reason_seg', 'doc_vqa', 'facail_expression_change_recognition', 'human_keypoint_detection', 'spot_the_similarity', 'visual_document_information_extraction', 'facial_expression_recognition', 'depth_estimation', 'one_shot_detection', 'whoops', 'meme_image_understanding', 'animal_keypoint_detection', 'scene_emotion_recognition'] Traceback (most recent call last): File "/ssd/ljj/vlmevalkit_run.py", line 196, in <module> main() File "/ssd/ljj/vlmevalkit_run.py", line 181, in main eval_results = dataset.evaluate(result_file, **judge_kwargs) File "/ssd/ljj/vlmeval/dataset/image_mcq.py", line 251, in evaluate acc = report_acc_MMT(data_main) File "/ssd/ljj/vlmeval/dataset/utils/multiple_choice.py", line 121, in report_acc_MMT abilities.sort() TypeError: '<' not supported between instances of 'float' and 'str'

junming-yang commented 1 month ago

Hi, @mary-0830 . I tried to reproduce this problem with latest main branch using the command python3 run.py --model InternVL2-1B --data MMT-Bench_VAL_MI. No error is reported.

It seems that MMT-Bench_VAL_MI.tsv file is incomplete. Maybe you can check or re-download the benchmark file?

mary-0830 commented 1 month ago

Hi, @mary-0830 . I tried to reproduce this problem with latest main branch using the command python3 run.py --model InternVL2-1B --data MMT-Bench_VAL_MI. No error is reported.

It seems that MMT-Bench_VAL_MI.tsv file is incomplete. Maybe you can check or re-download the benchmark file?

I download this 'MMT-Bench_VAL_MI': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL_MI.tsv', and I find an error in this data file. Why do I have this problem when I open it with vscode? image

mary-0830 commented 1 month ago

In addition, can you print your abilities? @junming-yang

junming-yang commented 1 month ago

Some questions in the MMT benchmark file exist decoding problem. This should have no effect on the running of the program.

This is our abilities:

['color_contrast', 'screenshot2code', 'jigsaw_puzzle_solving', 'image2image_retrieval', 'sketch2image_retrieval', 'table_structure_recognition', 'visual_prompt_understanding', 'facail_expression_change_recognition', 'traffic_anomaly_detection', 'point_tracking', 'writing_poetry_from_image', 'vehicle_keypoint_detection', 'painting_recognition', 'age_gender_race_recognition', 'multiple_instance_captioning', 'face_retrieval', 'waste_recognition', 'whoops', 'one_shot_detection', 'vehicle_recognition', 'video_captioning', 'handwritten_mathematical_expression_recognition', 'rotated_object_detection', 'animal_keypoint_detection', 'lesion_grading', 'navigation', 'next_img_prediction', 'micro_expression_recognition', 'color_constancy', 'pixel_localization', 'science', 'gaze_estimation', 'counting_by_visual_prompting', 'remote_sensing_object_detection', 'som_recognition', 'artwork_emotion_recognition', 'animated_character_recognition', 'body_emotion_recognition', 'scene_emotion_recognition', 'celebrity_recognition', 'image_dense_captioning', 'profession_recognition', 'image_based_action_recognition', 'plant_recognition', 'astronomical_recognition', 'film_and_television_recognition', 'religious_recognition', 'ravens_progressive_matrices', 'traffic_sign_understanding', 'sketch2code', 'geometrical_perspective', 'clock_reading', 'disease_diagnose', 'image_quality_assessment', 'gui_general', 'rock_recognition', 'small_object_detection', 'weather_recognition', 'logo_and_brand_recognition', 'social_relation_recognition', 'chart_to_table', 'medical_modality_recognition', 'temporal_anticipation', 'polygon_localization', 'sculpture_recognition', 'weapon_recognition', 'abstract_visual_recognition', 'single_object_tracking', 'doc_vqa', 'behavior_anomaly_detection', 'art_design', 'eqn2latex', 'relation_hallucination', 'fashion_recognition', 'shape_recognition', 'animals_recognition', 'object_detection', 'visual_document_information_extraction', 'deepfake_detection', 'temporal_ordering', 'human_object_interaction_recognition', 'exist_hallucination', 'mevis', 'reason_seg', 'humanitites_social_science', 'handwritten_retrieval', 'landmark_recognition', 'food_recognition', 'image_captioning', 'interactive_segmentation', 'scene_recognition', 'scene_text_recognition', 'salient_object_detection_rgb', 'image_matting', 'spot_the_similarity', 'referring_detection', 'crowd_counting', 'clothes_keypoint_detection', 'disaster_recognition', 'meme_image_understanding', 'camouflage_object_detection', 'scene_graph_recognition', 'counting_by_reasoning', 'color_assimilation', 'gesture_recognition', 'industrial_produce_anomaly_detection', 'traffic_participants_understanding', 'texture_material_recognition', 'business', 'health_medicine', 'traffic_light_understanding', 'face_mask_anomaly_dectection', 'other_biological_attributes', 'text2image_retrieval', 'color_recognition', 'spot_the_diff', 'human_interaction_understanding', 'gui_install', 'image_season_recognition', 'building_recognition', 'facial_expression_recognition', 'web_shopping', 'counting_by_category', 'instance_captioning', 'tech_engineering', 'depth_estimation', 'handwritten_text_recognition', 'helmet_anomaly_detection', 'furniture_keypoint_detection', 'image_captioning_paragraph', 'face_detection', 'vehicle_retrieval', 'electronic_object_recognition', 'multiple_image_captioning', 'pixel_recognition', 'threed_cad_recognition', 'geometrical_relativity', 'sign_language_recognition', 'meme_vedio_understanding', 'salient_object_detection_rgbd', 'muscial_instrument_recognition', 'sports_recognition', 'temporal_localization', 'action_quality_assessment', 'lvlm_response_judgement', 'national_flag_recognition', 'chemical_apparatusn_recognition', 'temporal_sequence_understanding', 'transparent_object_detection', 'image_colorization', 'multiple_view_image_understanding', 'order_hallucination', 'font_recognition', 'human_keypoint_detection', 'attribute_hallucination', 'general_action_recognition', 'google_apps', 'anatomy_identification', 'person_reid', 'threed_indoor_recognition', 'chart_to_text', 'chart_vqa']

mary-0830 commented 1 month ago

oh,I re-download the file, and encounter the error, like this. Traceback (most recent call last): File "/ssd/ljj/vlmevalkit_run.py", line 196, in <module> main() File "/ssd/ljj/vlmevalkit_run.py", line 181, in main eval_results = dataset.evaluate(result_file, **judge_kwargs) File "/ssd/ljj/vlmeval/dataset/image_mcq.py", line 184, in evaluate assert k in meta_q_map and data_map[k] == meta_q_map[k], ( AssertionError: eval_file should be the same as or a subset of dataset MMT-Bench_VAL_MI. data_map[k]: Given the image, please generate detailed steps to complete the following task: put the _x0008_ bottle on the fridge., meta_q_map[k]: Given the image, please generate detailed steps to complete the following task: put the bottle on the fridge., k: 8475

The above information is my print.

junming-yang commented 1 month ago

You can just remove this assert. We have fixed this problem in the latest main branch.

mary-0830 commented 1 month ago

You can just remove this assert. We have fixed this problem in the latest main branch.

ok!

mary-0830 commented 1 month ago

BTW, I would like to ask what is the difference between the four MMT-Bench. Which score is displayed on the official leaderboard?

mary-0830 commented 1 month ago

BTW, I would like to ask what is the difference between the four MMT-Bench. Which score is displayed on the official leaderboard?

junming-yang commented 1 month ago

We reported the MMT-Bench_VAL score in our leaderboard. ALL dataset includes test and validation data. And MI represents multi-image input. Eg, MMT-Bench_VAL concats multi-image into a single image. MMT-Bench_VAL_MI stays the original multi-image input.

mary-0830 commented 1 month ago

We reported the MMT-Bench_VAL score in our leaderboard. ALL dataset includes test and validation data. And MI represents multi-image input. Eg, MMT-Bench_VAL concats multi-image into a single image. MMT-Bench_VAL_MI stays the original multi-image input.

okok, thanks for your reply!