Open 8188zq opened 3 days ago
Hi, thank you for your interest in our work!
Can you please show me the arguments that were saved in your output files and the name of the output file?
Of course! The arguments saved in the output file are as follows:
"args": {
"config": "configs/rerank.yaml",
"tag": "eval",
"model_name_or_path": "gpt-4o-mini-2024-07-18",
"use_vllm": false,
"datasets": "msmarco_rerank_psg",
"demo_files": "data/msmarco/test_reranking_data_k10_dep3.jsonl",
"test_files": "data/msmarco/test_reranking_data_k1000_dep3.jsonl",
"output_dir": "output/gpt-4o-mini-2024-07-18",
"overwrite": false,
"max_test_samples": 100,
"num_workers": 4,
"num_depths": 10,
"popularity_threshold": 3,
"shots": 2,
"input_max_length": 131072,
"do_sample": false,
"generation_max_length": 200,
"generation_min_length": 0,
"temperature": 1.0,
"top_p": 1.0,
"stop_newline": false,
"seed": 42,
"no_cuda": false,
"no_bf16": false,
"no_torch_compile": false,
"use_chat_template": false,
"rope_theta": null,
"debug": false,
"count_tokens": false,
"stop_new_line": true
}
And the filename is msmarco_rerank_psg_eval_test_reranking_data_k1000_dep3_in131072_size100_shots2_sampFalsemax200min0t1.0p1.0_chatFalse_42.json
.
Let me know if you need further information!
I re-ran these experiments today, and I got NDCG@10=31.45
on 128k. It appears that the key difference between the arguments is use_chat_template
, where it should be True
for all API models. (it is set to False by default for running the base models).
You may refer to the script scripts/run_api.sh
for reproducing the results on the API models.
Note that you might still see some small differences in the results (up to 1-3 absolute point differences) due to the nondeterministic nature of the APIs.
Hope this helps!
It's strange—I directly used the command bash scripts/run_api.sh, but the results are still quite different. Could you please share your output files so I can check which step may have gone wrong?
You're truly appreciated!
You can find the result files here: https://drive.google.com/file/d/1PDRFhRXn4YcZ5IH9250gC5xe4CixXK4S/view?usp=sharing
What results are you getting from bash scripts/run_api.sh
? Let me know if you end up finding the differences, thanks!
Thank you very much! I compared the two and found that the only difference lies in the demos. Since the data.py code uses a hash, this difference is normal (or possibly need an extra hash seed setting). Additionally, there are some fluctuations in the API results. I'm not exactly sure where the difference comes from. I tried using the same inputs as you (including the demos), but the results still vary quite a bit. I suspect the API might be the cause, but I need to check further. If I find anything else later, I’ll be sure to share it with you! Thanks again!
Thanks for catching this! I will update the code with a deterministic hash function to get the demos.
Thank you for providing this excellent benchmark and sharing the evaluation results across various models!
When we tested the MSMARCO task using the evaluation code provided in this GitHub repository, we observed that the results seem to differ significantly from the numbers reported in the paper. Specifically, we used the same model (gpt-4o-mini-2024-07-18) as mentioned, but for the different lengths in MSMARCO, we only achieved results like 67(8k), 58(16k), 46(32k), 30(64k), and 22(128k).
Could you kindly advise if there's anything extra we need to do in order to reproduce the results from the paper? Currently, we are using the code exactly as provided in this repository.
Thank you!