Open HNUZCC opened 1 month ago
Hi HNUZCC,
I'm sorry for not getting back to you sooner. I just saw this message.
In your experiments, you run the overall data, which has 2,785 data instances, with 1,271 labelled as required retrieval and 1,514 labelled as do not require retrieval. We have presented these results in Table 10 in the Appendix (see Appendix A.6). On the other hand, the 28.2 Always Retrieval match score for Model-based TinyLlama (1.1B) in Table 1 was only evaluated on the 1,271 questions that require retrieval.
As stated in Section 3.2, different from strict matching, match score measures whether gold answers are included in the model predictions. So, for example, if "Canada" is the gold answer, and the model prediction is "The answer is Canada". Then, the match score would be 1, but the strict match score would be 0.
I run the bash run_lm.sh
show: ======= estimate no retrieval (q) API cost: 0.017889500000000003, total tokens #: 35779 ================ ======= estimate always retrieval (q+context) API cost: 0.892045, total tokens #: 1784090 ================ ======= total retrieval: [2785/2785] ================
{'data_source': 'retrievalqa', 'total_data_count': 2785, 'retrieval_frequency': 2785, 'retrieval_rate': 100.0, 'match_score': 59.9, 'f1_score': 15.2, 'em_score': 0.1, 'accuracy_score': 34.3, 'match_total': 1667, 'f1_total': 424.5294026557026, 'em_total': 4.0, 'accuracy_total': 954.0, 'total_q_tokens': 35779, 'total_context_tokens': 1748311, 'total_no_retrieval_tokens': 35779, 'total_always_retrieval_tokens': 1748311, 'estimate_no_retrieval_cost': 0.017889500000000003, 'estimate_always_retrieval_cost': 0.892045, 'saved_cost_rate': 0.9799455184435762, 'args': {'openai_config_path': './openai_config.txt', 'data_source': 'retrievalqa', 'retrieval_mode': 'always_retrieval', 'input_data_path': './data/retrievalqa.jsonl', 'output_score_path': './results/always_retrieval/TinyLlama/TinyLlama-1.1B-Chat-v1.0/m=vanilla/t=0.0/score_retrievalqa_seed20.json', 'output_prediction_path': './results/always_retrieval/TinyLlama/TinyLlama-1.1B-Chat-v1.0/m=vanilla/t=0.0/predict_retrievalqa_seed20.jsonl', 'model_name': 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', 'max_tokens': 100, 'batch_size': 1, 'doc_top_n': 5, 'limit_input': 0, 'prompt_method': 'vanilla', 'seed': 20, 'temperature': 0.0, 'top_p': 1.0, 'world_size': 1}} ./results/always_retrieval/TinyLlama/TinyLlama-1.1B-Chat-v1.0/m=vanilla/t=0.0 ./results/always_retrieval/TinyLlama/TinyLlama-1.1B-Chat-v1.0/m=vanilla/t=0.0
but the article shows that Model-based TinyLlama (1.1B) Always Retrieval match is 28.2. what is the match mean? The reproduced data seems to be inconsistent with it, is it my misunderstanding or my operational error?