Discrepancy in gpt4o-mini Results on MSMarco Compared to Reported Results

8188zq commented 1 month ago

Thank you for providing this excellent benchmark and sharing the evaluation results across various models!

When we tested the MSMARCO task using the evaluation code provided in this GitHub repository, we observed that the results seem to differ significantly from the numbers reported in the paper. Specifically, we used the same model (gpt-4o-mini-2024-07-18) as mentioned, but for the different lengths in MSMARCO, we only achieved results like 67（8k）, 58(16k), 46(32k), 30(64k), and 22(128k).

Could you kindly advise if there's anything extra we need to do in order to reproduce the results from the paper? Currently, we are using the code exactly as provided in this repository.

Thank you!

howard-yen commented 1 month ago

Hi, thank you for your interest in our work!

Can you please show me the arguments that were saved in your output files and the name of the output file?

8188zq commented 1 month ago

Of course! The arguments saved in the output file are as follows:

"args": {
    "config": "configs/rerank.yaml",
    "tag": "eval",
    "model_name_or_path": "gpt-4o-mini-2024-07-18",
    "use_vllm": false,
    "datasets": "msmarco_rerank_psg",
    "demo_files": "data/msmarco/test_reranking_data_k10_dep3.jsonl",
    "test_files": "data/msmarco/test_reranking_data_k1000_dep3.jsonl",
    "output_dir": "output/gpt-4o-mini-2024-07-18",
    "overwrite": false,
    "max_test_samples": 100,
    "num_workers": 4,
    "num_depths": 10,
    "popularity_threshold": 3,
    "shots": 2,
    "input_max_length": 131072,
    "do_sample": false,
    "generation_max_length": 200,
    "generation_min_length": 0,
    "temperature": 1.0,
    "top_p": 1.0,
    "stop_newline": false,
    "seed": 42,
    "no_cuda": false,
    "no_bf16": false,
    "no_torch_compile": false,
    "use_chat_template": false,
    "rope_theta": null,
    "debug": false,
    "count_tokens": false,
    "stop_new_line": true
}

And the filename is msmarco_rerank_psg_eval_test_reranking_data_k1000_dep3_in131072_size100_shots2_sampFalsemax200min0t1.0p1.0_chatFalse_42.json.

Let me know if you need further information!

howard-yen commented 1 month ago

I re-ran these experiments today, and I got NDCG@10=31.45 on 128k. It appears that the key difference between the arguments is use_chat_template, where it should be True for all API models. (it is set to False by default for running the base models). You may refer to the script scripts/run_api.sh for reproducing the results on the API models. Note that you might still see some small differences in the results (up to 1-3 absolute point differences) due to the nondeterministic nature of the APIs. Hope this helps!

8188zq commented 1 month ago

It's strange—I directly used the command bash scripts/run_api.sh, but the results are still quite different. Could you please share your output files so I can check which step may have gone wrong?

You're truly appreciated!

howard-yen commented 1 month ago

You can find the result files here: https://drive.google.com/file/d/1PDRFhRXn4YcZ5IH9250gC5xe4CixXK4S/view?usp=sharing What results are you getting from bash scripts/run_api.sh? Let me know if you end up finding the differences, thanks!

8188zq commented 1 month ago

Thank you very much! I compared the two and found that the only difference lies in the demos. Since the data.py code uses a hash, this difference is normal (or possibly need an extra hash seed setting). Additionally, there are some fluctuations in the API results. I'm not exactly sure where the difference comes from. I tried using the same inputs as you (including the demos), but the results still vary quite a bit. I suspect the API might be the cause, but I need to check further. If I find anything else later, I’ll be sure to share it with you! Thanks again!

howard-yen commented 1 month ago

Thanks for catching this! I will update the code with a deterministic hash function to get the demos.

8188zq commented 4 weeks ago

Hi, thank you very much for your attention and response!

I've encountered an issue while aligning the evaluation results for MSMARCO. Specifically, there seems to be a problem with the qrels used in the metric calculations. In the code, this occurs between lines 305 and 321 of data.py.

The logic used here —

data = data.filter(lambda x: x[key] in keys)
keys = random.sample(sorted(keys), min(max_test_samples, len(keys)))

— appears correct. However, the data field data['qid'] is not a unique primary key, meaning that multiple records share the same qid, which causes the filter operation to act more like a random sampler.

More critically, in the get_qrels function, the qrels are assigned as:

qrels[d["qid"]] = {c["id"]: c["label"] for c in d["ctxs"]}

Since d["qid"] is not unique, it seems like later data might overwrite earlier entries, and the overwritten entries are not identical. This would likely lead to issues when computing the results in the calculate_retrieval_metrics function using:

pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string, "recip_rank"})

Could you please provide insights or suggestions on how to address this issue?

Thank you once again for your time and for your valuable work!

howard-yen commented 4 weeks ago

Thanks for pointing this out, I should have added more documentation on this.

Yes there are multiple instances of the same qid but this is due to the way that I preprocessed the data, each unique qid corresponds to a unique question, and there are three different random permutations of the contexts for each question. That is, I shuffled the context order in each question, so there are actually three instances of each question.

Thus, the sample step is the following logic: we first sample some number of qids and then keep the three different permutations each qid, which actually yields 3x the specified max_test_samples. Since each qid is a unique question, that also means that the relevance labels are the same between the three permutations. Thus, although the qrels will have overwriting it does not affect the final score.

Hope this makes sense! I checked the code and it seems that I didn't set the random seed in load_msmarco, which could lead to different subsets. I will update the results in the next iteration with the same random seed set for all runs.

princeton-nlp / HELMET

Discrepancy in gpt4o-mini Results on MSMarco Compared to Reported Results #6