Closed srikanthmalla closed 2 months ago
I found the way to make the eval_beir.sh
to work, by changing it to pass right arguments to the right scripts. Added the updated shell script as pull request, please check: https://github.com/texttron/tevatron/pull/147
I also observed a weird trend on arguana dataset evaluation result when tested with the below parameters:
with query_max_len: 156 and passage_max_len: 32 (the parameters used for training the model, bad results)
Results:
recall_100 all 0.1309
ndcg_cut_10 all 0.0286
with query_max_len: 512 and passage_max_len: 512 (not the parameters trained with, but much better results)
Results:
recall_100 all 0.4744
ndcg_cut_10 all 0.1921
with intfloat/e5-mistral-7b-instruct
, on arguana dataset at max_len of 512 for both query and passage:
recall_100 all 0.9595
ndcg_cut_10 all 0.3668
but they reported: ndcg_at_10 on MTEB ArguAna as 61.882 (both in their paper and on https://huggingface.co/intfloat/e5-mistral-7b-instruct)
I am just worried if I am doing something wrong. I ran this:
./eval_beir.sh\
--dataset arguana \
--tokenizer intfloat/e5-mistral-7b-instruct \
--model_name_path intfloat/e5-mistral-7b-instruct \
--embedding_dir beir_embedding_arguana_e5_mistral \
--query_prefix "Query: " \
--passage_prefix "Passage: "
using updated evaluation shell script: https://github.com/texttron/tevatron/pull/147
Hi @srikanthmalla, Thank you very much for the pull request.
For running mistral-e5, could you please try following setting?
Set the passage prefix as “” (empty)
Set the query prefix as “Instruct: Given a claim, find documents that refute the claim\nQuery: “
add the ‘—normalize’ flag for both query and document encoding.
these are the setting from https://arxiv.org/abs/2401.00368. i.e. no passage prefix, task specific query prefix, and cosine similarity for embeddings
Hi @MXueguang , Thank you for the response. It works now! I made those three changes and ran it.
I got this performance on arguana:
recall_100 all 0.9936
ndcg_cut_10 all 0.6204
Hi, Thanks for making a consolidated training and evaluation repo for IR research.
I was able to train mistral-7b with pytorch and deepspeed as shown in example in the readme. Now I want to try to evaluate on beir, but the evaluate_beir.sh looks like it is moved from 1.0 to main, it doesn't work and there is no faiss_retriever as well, which is also called from that shell script. Are there any plans to fix the evaluation on main? if it's simple.
Should I use 1.0 instead to retrain and evaluate? But the readme in 1.0 is pointing to the main? Would it work or 1.0 has different example to run (meaning different set of flags or params)?
Please let me know.
Thanks, Srikanth