texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
486 stars 92 forks source link

Question about reproducing coCondenser-nq #38

Closed Facico closed 2 years ago

Facico commented 2 years ago

Hi, @luyug.

Thanks for your awesome work and detailed guidelines. I reproduced the model according to coCondenser-nq's [README](https://github.com/texttron/tevatron/tree/main/examples/coCondenser-nq). But I got the following results.(results from pyserini)

Top5    accuracy: 0.3526315789473684                         
Top20   accuracy: 0.47700831024930745 
Top100  accuracy: 0.5833795013850416 

I think I made a mistake in one step, so that the results is lower than the results on bm25. I sequentially execute the following scripts to train the model.(The model co-condenser-wiki was downloaded from huggingface.)

#prepare_data.sh

nq_train_path="/data2/private/xxx/DPR/downloads/data/retriever/nq-train.json" #biencoder-nq-train.json
output_path="/data2/private/xxx/condenser/nq-train/bm25.bert.json"
model_path="/data2/private/xxx/model/co-condenser-wiki"
hn_path="/data2/private/xxx/condenser/hn.json"
output_hn_path="/data2/private/xxx/condenser/nq-train/hn.bert.json"
python prepare_wiki_train.py --input $nq_train_path --output $output_path --tokenizer $model_path

python prepare_wiki_train.py --input $hn_path --output $output_hn_path --tokenizer $model_path
#train_nq.sh
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
output_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
CUDA_VISIBLE_DEVICES=2,3,4,5 python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train \
  --output_dir $output_path \
  --model_name_or_path $CONDENSER_MODEL_NAME \
  --cache_dir $cache_path \
  --do_train \
  --save_steps 10000 \
  --train_dir $train_path \
  --fp16 \
  --per_device_train_batch_size 32 \
  --train_n_passages 2 \
  --learning_rate 5e-6 \
  --q_max_len 32 \
  --p_max_len 256 \
  --num_train_epochs 40 \
  --negatives_x_device \
  --positive_passage_no_shuffle \
  --untie_encoder \
  --grad_cache \
  --gc_p_chunk_size 24 \
  --gc_q_chunk_size 8
#encode_emb_passage.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model

echo $1 #  $1 is the id of GPU
for s in $(seq -f "%02g" $2 $3) # 0 - 19
do
CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --cache_dir $cache_path \
  --model_name_or_path $model_path/checkpoint-40000/passage_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --p_max_len 256 \
  --dataset_proc_num 8 \
  --encode_in_path $wiki_dir/docs$s.json \
  --encoded_save_path $emb_nq_path/$s.pt \
  --encode_num_shard 20 \
  --passage_field_separator sep_token \
  --encode_shard_index $s
done
#encode_emb_query.sh

OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"
CONDENSER_MODEL_NAME="/data2/private/xxx/model/co-condenser-wiki"
train_path="/data2/private/xxx/condenser/nq-train/"
model_path="/data2/private/xxx/condenser/model_nq3/"
cache_path="/data2/private/xxx/condenser/.cache/"
emb_nq_path="/data2/private/xxx/condenser/embeddings-nq/"
emb_query_path="/data2/private/xxx/condenser/embeddings-nq-queries/"
query_path="/data2/private/xxx/condenser/nq-test-queries.json"
MODEL_DIR=nq-model

# query

CUDA_VISIBLE_DEVICES=$1 python -m tevatron.driver.encode \
  --output_dir=$OUTDIR \
  --model_name_or_path $model_path/checkpoint-40000/query_model \
  --tokenizer_name $model_path \
  --fp16 \
  --per_device_eval_batch_size 64 \
  --q_max_len 32 \
  --dataset_proc_num 2 \
  --encode_in_path $query_path \
  --encoded_save_path $emb_query_path/query.pt \
  --encode_is_qry
#inference.sh

ENCODE_QRY_DIR="/data2/private/xxx/condenser/embeddings-nq-queries/"
ENCODE_DIR="/data2/private/xxx/condenser/embeddings-nq/"
DEPTH=200
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
OUTDIR="./temp"
wiki_dir="/data2/private/xxx/condenser/wikipedia-corpus" 
#"/data2/private/xxx/DPR/downloads/psgs_w100.tsv"

MODEL_DIR=nq-model
python -m tevatron.faiss_retriever \
--query_reps $ENCODE_QRY_DIR/query.pt \
--passage_reps $ENCODE_DIR/'*.pt' \
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to $RUN
#eval.sh
RUN="/data2/private/xxx/condenser/run.nq.test.txt"
trec_out="/data2/private/xxx/condenser/run.nq.test.teIn"
json_out="/data2/private/xxx/condenser/run.nq.test.json"
python -m tevatron.utils.format.convert_result_to_trec \
    --input $RUN --output $trec_out

python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --topics dpr-nq-test \
                                                                --index wikipedia-dpr \
                                                                --input $trec_out \
                                                                --output $json_out

python -m pyserini.eval.evaluate_dpr_retrieval --retrieval $json_out \
    --topk 5 20 100

Is there any parameter I set wrong?

Thanks!

Facico commented 2 years ago

I found that I was setting up the dataset incorrectly. I use the wikipedia-corpus.tar.gz according to this [README](https://github.com/texttron/tevatron/blob/7a3a05914cbeb8158b2ab8fe4c5f9990e03ef834/examples/dpr/README.md#alternatives-train-dpr-with-our-self-contained-datasets). I replace the "--encode_in_path $wiki_dir/docs$s.json" with “--dataset_name Tevatron/wikipedia-nq-corpus” and the result is fine.

tecmry commented 2 years ago

Hi, I would like to ask about your experimental results on the nq dataset use bm25 negative sample

Facico commented 2 years ago

Hi, I would like to ask about your experimental results on the nq dataset use bm25 negative sample

Sorry, my experiment is based on both bm25 negative and hard negative sample. The experimental results are similar to the results of the paper(ours in nq test -> R@20: 84.9, R@100: 89.5 )

tecmry commented 2 years ago

Thanks for your reply, is the result you mentioned following the steps of using the bm25 negative sample in the first stage and using the hard negative sample in the second stage?

Facico commented 2 years ago

Yes, my settings are basically the same as here: https://github.com/texttron/tevatron/tree/main/examples/coCondenser-nq

tecmry commented 2 years ago

hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?

Facico commented 2 years ago

hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?

Of course, there is a high score between query and hard negative sample. But this case may not exist in some queries(the model performs well in those queries, so they don't need hard negative samples)

tecmry commented 2 years ago

hi, I found that the number of bm25 negative samples and hard negative samples provided in nq-example is different, is this reasonable?

Of course, there is a high score between query and hard negative sample. But this case may not exist in some queries(the model performs well in those queries, so they don't need hard negative samples)

Thank you for your answer, but I found that biencoder-nq-train.json has 58880 questions and hn.json has 70076 questions. I don't know how this extra question was generated.