Simplifying evaluation process

Right now, it's possible to train DPR in a single command, via the tevatron.driver.train module. However, to evaluate, a more complex series of command (involving lower-level for loops) is needed, e.g. for DPR on NQ:

mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq-corpus \
  --encoded_save_path corpus_emb.$s.pkl \
  --encode_num_shard 20 \
  --encode_shard_index $s
done

python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_nq \
  --fp16 \
  --per_device_eval_batch_size 156 \
  --dataset_name Tevatron/wikipedia-nq/test \
  --encoded_save_path query_emb.pkl \
  --encode_is_qry

python -m tevatron.faiss_retriever \
--query_reps query_emb.pkl \
--passage_reps 'corpus_emb.*.pkl' \
--depth 100 \
--batch_size -1 \
--save_text \
--save_ranking_to run.nq.test.txt

python -m tevatron.utils.format.convert_result_to_trec \
              --input run.nq.test.txt \
              --output run.nq.test.trec

pip install pyserini

python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run \
              --topics dpr-nq-test \
              --index wikipedia-dpr \
              --input run.nq.test.trec \
              --output run.nq.test.json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval run.nq.test.json \
                --topk 20 100

I think it would be nicer if all this could be reduce to 1 or 2 commands:

pip install pyserini

python -m tevatron.driver.evaluate \
    --output_dir "temp" \
    --model_name_or_path "model_nq" \
    ...
    --query_dataset "Tevatron/wikipedia-nq/" \
    --passage_dataset "Tevatron/wikipedia-nq/test" \
    --save_ranking_to "nq_results/test/" \
    --encode_method "faiss" \
    --save_format "trec" "pyserini_dpr"  # save in both .trec and .json

python -m pyserini.eval.evaluate_dpr_retrieval \
                --retrieval "nq_results/test/run.json" \
                --topk 20 100

Note the usage of tevatron.driver.evaluate in order to keep driver.encode at a lower level and backward compatible, while evaluate would be for higher-level usage like reproducing results. Moreover, tevatron.driver.evaluate could throw an error if pyserini is not available, e.g.:

ImportError: could not import pyserini, a library needed to save as format "pyserini_dpr". Please install with `pip install pyserini`

texttron / tevatron

Simplifying evaluation process #52