texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
518 stars 100 forks source link

Evaluation on main branch is not working #146

Closed srikanthmalla closed 2 months ago

srikanthmalla commented 2 months ago

Hi, Thanks for making a consolidated training and evaluation repo for IR research.

I was able to train mistral-7b with pytorch and deepspeed as shown in example in the readme. Now I want to try to evaluate on beir, but the evaluate_beir.sh looks like it is moved from 1.0 to main, it doesn't work and there is no faiss_retriever as well, which is also called from that shell script. Are there any plans to fix the evaluation on main? if it's simple.

Should I use 1.0 instead to retrain and evaluate? But the readme in 1.0 is pointing to the main? Would it work or 1.0 has different example to run (meaning different set of flags or params)?

Please let me know.

Thanks, Srikanth

srikanthmalla commented 2 months ago

I found the way to make the eval_beir.sh to work, by changing it to pass right arguments to the right scripts. Added the updated shell script as pull request, please check: https://github.com/texttron/tevatron/pull/147

srikanthmalla commented 2 months ago

I also observed a weird trend on arguana dataset evaluation result when tested with the below parameters:

  1. with query_max_len: 156 and passage_max_len: 32 (the parameters used for training the model, bad results)

    Results:
    recall_100              all 0.1309
    ndcg_cut_10             all 0.0286
  2. with query_max_len: 512 and passage_max_len: 512 (not the parameters trained with, but much better results)

    
    Results:
    recall_100              all 0.4744
    ndcg_cut_10             all 0.1921
srikanthmalla commented 2 months ago

with intfloat/e5-mistral-7b-instruct, on arguana dataset at max_len of 512 for both query and passage:

recall_100              all 0.9595
ndcg_cut_10             all 0.3668

but they reported: ndcg_at_10 on MTEB ArguAna as 61.882 (both in their paper and on https://huggingface.co/intfloat/e5-mistral-7b-instruct)

I am just worried if I am doing something wrong. I ran this:

./eval_beir.sh\
                    --dataset arguana \
                    --tokenizer intfloat/e5-mistral-7b-instruct \
                    --model_name_path intfloat/e5-mistral-7b-instruct \
                    --embedding_dir beir_embedding_arguana_e5_mistral \
                    --query_prefix "Query: " \
                    --passage_prefix "Passage: "

using updated evaluation shell script: https://github.com/texttron/tevatron/pull/147

MXueguang commented 2 months ago

Hi @srikanthmalla, Thank you very much for the pull request.

For running mistral-e5, could you please try following setting?

Set the passage prefix as “” (empty)
Set the query prefix as “Instruct: Given a claim, find documents that refute the claim\nQuery: “
add the ‘—normalize’ flag for both query and document encoding.

these are the setting from https://arxiv.org/abs/2401.00368. i.e. no passage prefix, task specific query prefix, and cosine similarity for embeddings

srikanthmalla commented 2 months ago

Hi @MXueguang , Thank you for the response. It works now! I made those three changes and ran it.

I got this performance on arguana:

recall_100              all 0.9936
ndcg_cut_10             all 0.6204