Closed mary-0830 closed 1 month ago
Hi, @mary-0830 . I tried to reproduce this problem with latest main branch using the command python3 run.py --model InternVL2-1B --data MMT-Bench_VAL_MI
. No error is reported.
It seems that MMT-Bench_VAL_MI.tsv
file is incomplete. Maybe you can check or re-download the benchmark file?
Hi, @mary-0830 . I tried to reproduce this problem with latest main branch using the command
python3 run.py --model InternVL2-1B --data MMT-Bench_VAL_MI
. No error is reported.It seems that
MMT-Bench_VAL_MI.tsv
file is incomplete. Maybe you can check or re-download the benchmark file?
I download this 'MMT-Bench_VAL_MI': 'https://opencompass.openxlab.space/utils/VLMEval/MMT-Bench_VAL_MI.tsv', and I find an error in this data file. Why do I have this problem when I open it with vscode?
In addition, can you print your abilities? @junming-yang
Some questions in the MMT benchmark file exist decoding problem. This should have no effect on the running of the program.
This is our abilities:
['color_contrast', 'screenshot2code', 'jigsaw_puzzle_solving', 'image2image_retrieval', 'sketch2image_retrieval', 'table_structure_recognition', 'visual_prompt_understanding', 'facail_expression_change_recognition', 'traffic_anomaly_detection', 'point_tracking', 'writing_poetry_from_image', 'vehicle_keypoint_detection', 'painting_recognition', 'age_gender_race_recognition', 'multiple_instance_captioning', 'face_retrieval', 'waste_recognition', 'whoops', 'one_shot_detection', 'vehicle_recognition', 'video_captioning', 'handwritten_mathematical_expression_recognition', 'rotated_object_detection', 'animal_keypoint_detection', 'lesion_grading', 'navigation', 'next_img_prediction', 'micro_expression_recognition', 'color_constancy', 'pixel_localization', 'science', 'gaze_estimation', 'counting_by_visual_prompting', 'remote_sensing_object_detection', 'som_recognition', 'artwork_emotion_recognition', 'animated_character_recognition', 'body_emotion_recognition', 'scene_emotion_recognition', 'celebrity_recognition', 'image_dense_captioning', 'profession_recognition', 'image_based_action_recognition', 'plant_recognition', 'astronomical_recognition', 'film_and_television_recognition', 'religious_recognition', 'ravens_progressive_matrices', 'traffic_sign_understanding', 'sketch2code', 'geometrical_perspective', 'clock_reading', 'disease_diagnose', 'image_quality_assessment', 'gui_general', 'rock_recognition', 'small_object_detection', 'weather_recognition', 'logo_and_brand_recognition', 'social_relation_recognition', 'chart_to_table', 'medical_modality_recognition', 'temporal_anticipation', 'polygon_localization', 'sculpture_recognition', 'weapon_recognition', 'abstract_visual_recognition', 'single_object_tracking', 'doc_vqa', 'behavior_anomaly_detection', 'art_design', 'eqn2latex', 'relation_hallucination', 'fashion_recognition', 'shape_recognition', 'animals_recognition', 'object_detection', 'visual_document_information_extraction', 'deepfake_detection', 'temporal_ordering', 'human_object_interaction_recognition', 'exist_hallucination', 'mevis', 'reason_seg', 'humanitites_social_science', 'handwritten_retrieval', 'landmark_recognition', 'food_recognition', 'image_captioning', 'interactive_segmentation', 'scene_recognition', 'scene_text_recognition', 'salient_object_detection_rgb', 'image_matting', 'spot_the_similarity', 'referring_detection', 'crowd_counting', 'clothes_keypoint_detection', 'disaster_recognition', 'meme_image_understanding', 'camouflage_object_detection', 'scene_graph_recognition', 'counting_by_reasoning', 'color_assimilation', 'gesture_recognition', 'industrial_produce_anomaly_detection', 'traffic_participants_understanding', 'texture_material_recognition', 'business', 'health_medicine', 'traffic_light_understanding', 'face_mask_anomaly_dectection', 'other_biological_attributes', 'text2image_retrieval', 'color_recognition', 'spot_the_diff', 'human_interaction_understanding', 'gui_install', 'image_season_recognition', 'building_recognition', 'facial_expression_recognition', 'web_shopping', 'counting_by_category', 'instance_captioning', 'tech_engineering', 'depth_estimation', 'handwritten_text_recognition', 'helmet_anomaly_detection', 'furniture_keypoint_detection', 'image_captioning_paragraph', 'face_detection', 'vehicle_retrieval', 'electronic_object_recognition', 'multiple_image_captioning', 'pixel_recognition', 'threed_cad_recognition', 'geometrical_relativity', 'sign_language_recognition', 'meme_vedio_understanding', 'salient_object_detection_rgbd', 'muscial_instrument_recognition', 'sports_recognition', 'temporal_localization', 'action_quality_assessment', 'lvlm_response_judgement', 'national_flag_recognition', 'chemical_apparatusn_recognition', 'temporal_sequence_understanding', 'transparent_object_detection', 'image_colorization', 'multiple_view_image_understanding', 'order_hallucination', 'font_recognition', 'human_keypoint_detection', 'attribute_hallucination', 'general_action_recognition', 'google_apps', 'anatomy_identification', 'person_reid', 'threed_indoor_recognition', 'chart_to_text', 'chart_vqa']
oh,I re-download the file, and encounter the error, like this.
Traceback (most recent call last): File "/ssd/ljj/vlmevalkit_run.py", line 196, in <module> main() File "/ssd/ljj/vlmevalkit_run.py", line 181, in main eval_results = dataset.evaluate(result_file, **judge_kwargs) File "/ssd/ljj/vlmeval/dataset/image_mcq.py", line 184, in evaluate assert k in meta_q_map and data_map[k] == meta_q_map[k], ( AssertionError: eval_file should be the same as or a subset of dataset MMT-Bench_VAL_MI. data_map[k]: Given the image, please generate detailed steps to complete the following task: put the _x0008_ bottle on the fridge., meta_q_map[k]: Given the image, please generate detailed steps to complete the following task: put the bottle on the fridge., k: 8475
The above information is my print.
You can just remove this assert. We have fixed this problem in the latest main branch.
You can just remove this assert. We have fixed this problem in the latest main branch.
ok!
BTW, I would like to ask what is the difference between the four MMT-Bench. Which score is displayed on the official leaderboard?
BTW, I would like to ask what is the difference between the four MMT-Bench. Which score is displayed on the official leaderboard?
We reported the MMT-Bench_VAL score in our leaderboard. ALL dataset includes test and validation data. And MI represents multi-image input. Eg, MMT-Bench_VAL concats multi-image into a single image. MMT-Bench_VAL_MI stays the original multi-image input.
We reported the MMT-Bench_VAL score in our leaderboard. ALL dataset includes test and validation data. And MI represents multi-image input. Eg, MMT-Bench_VAL concats multi-image into a single image. MMT-Bench_VAL_MI stays the original multi-image input.
okok, thanks for your reply!
hi, I had problems evaluating the MMT-Bench_VAL_MI, as shown below:
abilities: ['navigation', 'image_matting', nan, 'meme_vedio_understanding', 'single_object_tracking', 'counting_by_visual_prompting', 'chart_to_table', 'multiple_view_image_understanding', 'table_structure_recognition', 'vehicle_keypoint_detection', 'furniture_keypoint_detection', 'artwork_emotion_recognition', 'referring_detection', 'spot_the_diff', 'polygon_localization', 'jigsaw_puzzle_solving', 'interactive_segmentation', 'clothes_keypoint_detection', 'pixel_localization', 'chart_vqa', 'point_tracking', 'counting_by_reasoning', 'ravens_progressive_matrices', 'pixel_recognition', 'clock_reading', 'counting_by_category', 'chart_to_text', 'crowd_counting', 'micro_expression_recognition', 'image_colorization', 'body_emotion_recognition', 'reason_seg', 'doc_vqa', 'facail_expression_change_recognition', 'human_keypoint_detection', 'spot_the_similarity', 'visual_document_information_extraction', 'facial_expression_recognition', 'depth_estimation', 'one_shot_detection', 'whoops', 'meme_image_understanding', 'animal_keypoint_detection', 'scene_emotion_recognition'] Traceback (most recent call last): File "/ssd/ljj/vlmevalkit_run.py", line 196, in <module> main() File "/ssd/ljj/vlmevalkit_run.py", line 181, in main eval_results = dataset.evaluate(result_file, **judge_kwargs) File "/ssd/ljj/vlmeval/dataset/image_mcq.py", line 251, in evaluate acc = report_acc_MMT(data_main) File "/ssd/ljj/vlmeval/dataset/utils/multiple_choice.py", line 121, in report_acc_MMT abilities.sort() TypeError: '<' not supported between instances of 'float' and 'str'