texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
516 stars 100 forks source link

Simplifying evaluation process #52

Open xhluca opened 2 years ago

xhluca commented 2 years ago

Right now, it's possible to train DPR in a single command, via the tevatron.driver.train module. However, to evaluate, a more complex series of command (involving lower-level for loops) is needed, e.g. for DPR on NQ:

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

python -m tevatron.faiss_retriever \
--query_reps query_emb.pkl \
--passage_reps 'corpus_emb.*.pkl' \
--depth 100 \
--batch_size -1 \
--save_text \
--save_ranking_to run.nq.test.txt

python -m tevatron.utils.format.convert_result_to_trec \
              --input run.nq.test.txt \
              --output run.nq.test.trec

pip install pyserini

python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \
              --topics dpr-nq-test \
              --index wikipedia-dpr \
              --input run.nq.test.trec \
              --output run.nq.test.json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval run.nq.test.json \
                --topk 20 100

I think it would be nicer if all this could be reduce to 1 or 2 commands:

pip install pyserini

python -m tevatron.driver.evaluate \
    --output_dir "temp" \
    --model_name_or_path "model_nq" \
    ...
    --query_dataset "Tevatron/wikipedia-nq/" \
    --passage_dataset "Tevatron/wikipedia-nq/test" \
    --save_ranking_to "nq_results/test/" \
    --encode_method "faiss" \
    --save_format "trec" "pyserini_dpr"  # save in both .trec and .json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval "nq_results/test/run.json" \
                --topk 20 100

Note the usage of tevatron.driver.evaluate in order to keep driver.encode at a lower level and backward compatible, while evaluate would be for higher-level usage like reproducing results. Moreover, tevatron.driver.evaluate could throw an error if pyserini is not available, e.g.:

ImportError: could not import pyserini, a library needed to save as format "pyserini_dpr". Please install with `pip install pyserini`
MXueguang commented 2 years ago

Hi @xhluca, Thanks for the suggestion. I guess here one reason we keep the encoding process separately is to keep it flexible wrt tasks (e.g. NQ/MSMARCO) and GPU/RAM resources. I agree that the evaluation process of dpr can be simpler, maybe we can have a simpler dpr evaluation in pyserini. I'll take a look.

Xueguang