princeton-nlp / HELMET

The HELMET Benchmark
https://arxiv.org/abs/2410.02694
MIT License
51 stars 7 forks source link

Discrepancy in gpt4o-mini Results on MSMarco Compared to Reported Results #6

Open 8188zq opened 3 days ago

8188zq commented 3 days ago

Thank you for providing this excellent benchmark and sharing the evaluation results across various models!

When we tested the MSMARCO task using the evaluation code provided in this GitHub repository, we observed that the results seem to differ significantly from the numbers reported in the paper. Specifically, we used the same model (gpt-4o-mini-2024-07-18) as mentioned, but for the different lengths in MSMARCO, we only achieved results like 67(8k), 58(16k), 46(32k), 30(64k), and 22(128k).

Could you kindly advise if there's anything extra we need to do in order to reproduce the results from the paper? Currently, we are using the code exactly as provided in this repository.

Thank you!

howard-yen commented 3 days ago

Hi, thank you for your interest in our work!

Can you please show me the arguments that were saved in your output files and the name of the output file?

8188zq commented 2 days ago

Of course! The arguments saved in the output file are as follows:

"args": {
    "config": "configs/rerank.yaml",
    "tag": "eval",
    "model_name_or_path": "gpt-4o-mini-2024-07-18",
    "use_vllm": false,
    "datasets": "msmarco_rerank_psg",
    "demo_files": "data/msmarco/test_reranking_data_k10_dep3.jsonl",
    "test_files": "data/msmarco/test_reranking_data_k1000_dep3.jsonl",
    "output_dir": "output/gpt-4o-mini-2024-07-18",
    "overwrite": false,
    "max_test_samples": 100,
    "num_workers": 4,
    "num_depths": 10,
    "popularity_threshold": 3,
    "shots": 2,
    "input_max_length": 131072,
    "do_sample": false,
    "generation_max_length": 200,
    "generation_min_length": 0,
    "temperature": 1.0,
    "top_p": 1.0,
    "stop_newline": false,
    "seed": 42,
    "no_cuda": false,
    "no_bf16": false,
    "no_torch_compile": false,
    "use_chat_template": false,
    "rope_theta": null,
    "debug": false,
    "count_tokens": false,
    "stop_new_line": true
}

And the filename is msmarco_rerank_psg_eval_test_reranking_data_k1000_dep3_in131072_size100_shots2_sampFalsemax200min0t1.0p1.0_chatFalse_42.json.

Let me know if you need further information!

howard-yen commented 2 days ago

I re-ran these experiments today, and I got NDCG@10=31.45 on 128k. It appears that the key difference between the arguments is use_chat_template, where it should be True for all API models. (it is set to False by default for running the base models). You may refer to the script scripts/run_api.sh for reproducing the results on the API models. Note that you might still see some small differences in the results (up to 1-3 absolute point differences) due to the nondeterministic nature of the APIs. Hope this helps!

8188zq commented 2 days ago

It's strange—I directly used the command bash scripts/run_api.sh, but the results are still quite different. Could you please share your output files so I can check which step may have gone wrong?

You're truly appreciated!

howard-yen commented 1 day ago

You can find the result files here: https://drive.google.com/file/d/1PDRFhRXn4YcZ5IH9250gC5xe4CixXK4S/view?usp=sharing What results are you getting from bash scripts/run_api.sh? Let me know if you end up finding the differences, thanks!

8188zq commented 7 hours ago

Thank you very much! I compared the two and found that the only difference lies in the demos. Since the data.py code uses a hash, this difference is normal (or possibly need an extra hash seed setting). Additionally, there are some fluctuations in the API results. I'm not exactly sure where the difference comes from. I tried using the same inputs as you (including the demos), but the results still vary quite a bit. I suspect the API might be the cause, but I need to check further. If I find anything else later, I’ll be sure to share it with you! Thanks again!

howard-yen commented 6 hours ago

Thanks for catching this! I will update the code with a deterministic hash function to get the demos.