nyu-dl / dl4marco-bert

BSD 3-Clause "New" or "Revised" License
476 stars 87 forks source link

Passage Re-ranking with BERT

Introduction

***** Most of the code in this repository was copied from the original BERT repository.*****

This repository contains the code to reproduce our entry to the MSMARCO passage ranking task, which was placed first with a large margin over the second place. It also contains the code to reproduce our result on the TREC-CAR dataset, which is ~22 MAP points higher than the best entry from 2017 and a well-tuned BM25.

MSMARCO Passage Re-Ranking Leaderboard (Jan 8th 2019) Eval MRR@10 Dev MRR@10
1st Place - BERT (this code) 35.87 36.53
2nd Place - IRNet 28.06 27.80
3rd Place - Conv-KNRM 27.12 29.02
TREC-CAR Test Set (Automatic Annotations) MAP
BERT (this code) 33.5
BM25 Anserini 15.6
MacAvaney et al., 2017 (TREC-CAR 2017 Best Entry) 14.8

The paper describing our implementation is here.

Data

We made available the following data:

File Description Size MD5
BERT_Large_trained_on_MSMARCO.zip BERT-large trained on MS MARCO 3.4 GB 2616f874cdabadafc55626035c8ff8e8
BERT_Base_trained_on_MSMARCO.zip BERT-base trained on MS MARCO 1.1 GB 7a8c621e01c127b55dbe511812c34910
MSMARCO_tfrecord.tar.gz MS MARCO TF Records 9.1 GB c15d80fe9a56a2fb54eb7d94e2cfa4ef
BERT_Large_dev_run.tsv BERT-large run dev set (~6980 queries x 1000 docs per query) 121 MB bcbbe19bcb2549dea3f26168c2bc445b
BERT_Large_test_run.tsv BERT-large run test set (~6836 queries x 1000 docs per query) 119 MB 9779903606e5b545f491132d8c2cf292
BERT_Large_trained_on_TREC_CAR.tar.gz BERT-large trained on TREC-CAR 3.4 GB 8baedd876935093bfd2bdfa66f2279bc
BERT_Large_pretrained_on_TREC_CAR... BERT-large pretrained on TREC-CAR's training set for 1M iterations 3.4 GB 9c6f2f8dbf9825899ee460ee52423b84
treccar_files.tar.gz TREC-CAR queries, qrels, runs, and TF Records 4.0 GB 4e6b5580e0b2f2c709d76ac9c7e7f362
bert_predictions_test.run.tar.gz TREC-CAR 2017 Automatic Run reranked by BERT-Large 71M d5c135c6cf5a6d25199bba29d43b58ba

MS MARCO

Download and extract the data

First, we need to download and extract MS MARCO and BERT files:

DATA_DIR=./data
mkdir ${DATA_DIR}

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/top1000.eval.tar.gz -P ${DATA_DIR}
wget https://msmarco.blob.core.windows.net/msmarcoranking/qrels.dev.small.tsv -P ${DATA_DIR}
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-24_H-1024_A-16.zip -P ${DATA_DIR}

tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.dev.tar.gz -C ${DATA_DIR}
tar -xvf ${DATA_DIR}/top1000.eval.tar.gz -C ${DATA_DIR}
unzip ${DATA_DIR}/uncased_L-24_H-1024_A-16.zip -d ${DATA_DIR}

Convert MS MARCO to TFRecord format

Next, we need to convert MS MARCO train, dev, and eval files to TFRecord files, which will be later consumed by BERT.

mkdir ${DATA_DIR}/tfrecord
python convert_msmarco_to_tfrecord.py \
  --output_folder=${DATA_DIR}/tfrecord \
  --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
  --train_dataset_path=${DATA_DIR}/triples.train.small.tsv \
  --dev_dataset_path=${DATA_DIR}/top1000.dev.tsv \
  --eval_dataset_path=${DATA_DIR}/top1000.eval.tsv \
  --dev_qrels_path=${DATA_DIR}/qrels.dev.tsv \
  --max_query_length=64\
  --max_seq_length=512 \
  --num_eval_docs=1000

This conversion takes 30-40 hours. Alternatively, you may download the TFRecord files here (~23GB).

Training

We can now start training. We highly recommend using the free TPUs in our Google's Colab. Otherwise, a modern V100 GPU with 16GB cannot fit even a small batch size of 2 when training a BERT Large model.

In case you opt for not using the Colab, here is the command line to start training:

python run_msmarco.py \
  --data_dir=${DATA_DIR}/tfrecord \
  --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_model.ckpt \
  --output_dir=${DATA_DIR}/output \
  --msmarco_output=True \
  --do_train=True \
  --do_eval=True \
  --num_train_steps=100000 \
  --num_warmup_steps=10000 \
  --train_batch_size=128 \
  --eval_batch_size=128 \
  --learning_rate=3e-6

Training for 100k iterations takes approximately 30 hours on a TPU v3. Alternatively, you can download the trained model used in our submission here (~3.4GB).

You can also download a BERT Base model trained on MS MARCO here. This model leads to ~2 points lower MRR@10 (34.7), but it is faster to train and evaluate. It can also fit on a single 12GB GPU.

TREC-CAR

We describe in the next sections how to reproduce our results on the TREC-CAR dataset.

Downloading qrels, run and TFRecord files

The next steps (Indexing, Retrieval, and TFRecord conversion) take many hours. Alternatively, you can skip them and download the necessary files for training and evaluation here (~4.0GB), namely:

After downloading, you need to extract them to the TRECCAR_DIR folder:

TRECCAR_DIR=./treccar/
tar -xf treccar_files.tar.gz --directory ${TRECCAR_DIR}

And you are ready to go to the training/evaluation section.

Downloading and Extracting the data

If you decided to index, retrieve and convert to the TFRecord format, you first need to download and extract the TREC-CAR data:

TRECCAR_DIR=./treccar/
DATA_DIR=./data
mkdir ${DATA_DIR}

wget http://trec-car.cs.unh.edu/datareleases/v2.0/paragraphCorpus.v2.0.tar.xz -P ${TRECCAR_DIR}
wget http://trec-car.cs.unh.edu/datareleases/v2.0/train.v2.0.tar.xz -P ${TRECCAR_DIR}
wget http://trec-car.cs.unh.edu/datareleases/v2.0/benchmarkY1-test.v2.0.tar.xz -P ${TRECCAR_DIR}
wget https://storage.googleapis.com/bert_treccar_data/pretrained_models/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz -P ${DATA_DIR}

tar -xf  ${TRECCAR_DIR}/paragraphCorpus.v2.0.tar.xz
tar -xf  ${TRECCAR_DIR}/train.v2.0.tar.xz
tar -xf  ${TRECCAR_DIR}/benchmarkY1-test.v2.0.tar.xz
tar -xzf ${DATA_DIR}/BERT_Large_pretrained_on_TREC_CAR_training_set_1M_iterations.tar.gz

Indexing TREC-CAR

We need to index the corpus and retrieve documents using the BM25 algorithm for each query so we have query-document pairs for training.

We index the TREC-CAR corpus using Anserini, an excelent toolkit for information retrieval research.

First, we need to install Maven, and clone and compile Anserini's repository:

sudo apt-get install maven
git clone --recurse-submodules https://github.com/castorini/Anserini.git
cd Anserini
mvn clean package appassembler:assemble
tar xvfz tools/eval/trec_eval.9.0.4.tar.gz -C tools/eval/ && cd tools/eval/trec_eval.9.0.4 && make
cd ../ndeval && make

Now we can index the corpus (.cbor files):

sh Anserini/target/appassembler/bin/IndexCollection -collection CarCollection \
-generator DefaultLuceneDocumentGenerator -threads 40 -input ./paragraphCorpus.v2.0 -index \
./lucene-index.car17.pos+docvectors+rawdocs -storePositions -storeDocvectors \
-storeRawDocs

You should see a message like this after it finishes:

2019-01-15 20:26:28,742 INFO  [main] index.IndexCollection (IndexCollection.java:578) - Total 29,794,689 documents indexed in 03:20:35

Retrieving pairs of query-candidate document

We now retrieve candidate documents for each query using the BM25 algorithm. But first, we need to convert the TREC-CAR files to a format that Anserini can consume.

First, we merge qrels folds 0, 1, 2, and 3 into a single file for training. Fold 4 will be the dev set.

for f in ${TRECCAR_DIR}/train/fold-[0-3]-base.train.cbor-hierarchical.qrels; do (cat "${f}"; echo); done >${TRECCAR_DIR}/train.qrels
cp ${TRECCAR_DIR}/train/fold-4-base.train.cbor-hierarchical.qrels ${TRECCAR_DIR}/dev.qrels
cp ${TRECCAR_DIR}/benchmarkY1/benchmarkY1-test/test.pages.cbor-hierarchical.qrels ${TRECCAR_DIR}/test.qrels

We need to extract the queries (first column in the space-separated files):

cat ${TRECCAR_DIR}/train.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/train.topics
cat ${TRECCAR_DIR}/dev.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/dev.topics
cat ${TRECCAR_DIR}/test.qrels | cut -d' ' -f1 > ${TRECCAR_DIR}/test.topics

And remove all duplicated queries:

sort -u -o ${TRECCAR_DIR}/train.topics ${TRECCAR_DIR}/train.topics
sort -u -o ${TRECCAR_DIR}/dev.topics ${TRECCAR_DIR}/dev.topics
sort -u -o ${TRECCAR_DIR}/test.topics ${TRECCAR_DIR}/test.topics

We now retrieve the top-10 documents per query for training and development sets.

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/train.topics -output ${TRECCAR_DIR}/train.run -hits 10 -bm25 &

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/dev.topics -output ${TRECCAR_DIR}/dev.run -hits 10 -bm25 &

And we retrieve top-1,000 documents per query for the test set.

nohup target/appassembler/bin/SearchCollection -topicreader Car -index ${TRECCAR_DIR}/lucene-index.car17.pos+docvectors+rawdocs -topics ${TRECCAR_DIR}/test.topics -output ${TRECCAR_DIR}/test.run -hits 1000 -bm25 &

After it finishes, you should see an output message like this:

(SearchCollection.java:166) - [Finished] Ranking with similarity: BM25(k1=0.9,b=0.4)
2019-01-16 23:40:56,538 INFO  [pool-2-thread-1] search.SearchCollection$SearcherThread (SearchCollection.java:167) - Run 2254 topics searched in 01:53:32
2019-01-16 23:40:56,922 INFO  [main] search.SearchCollection (SearchCollection.java:499) - Total run time: 01:53:36

This retrieval step takes 40-80 hours for the training set. We can speed it up by increasing the number of threads (ex: -threads 6) and loading the index into memory (-inmem option).

Measuring BM25 Performance (optional)

To be sure that indexing and retrieval worked fine, we can measure the performance of this list of documents retrieved with BM25:

eval/trec_eval.9.0.4/trec_eval -m map -m recip_rank -c ${TRECCAR_DIR}/test.qrels ${TRECCAR_DIR}/test.run

It is important to use the -c option as it assigns a score of zero to queries that had no passage returned. The output should be like this:

map                     all 0.1528
recip_rank              all 0.2294

Converting TREC-CAR to TFRecord

We can now convert qrels (query-relevant document pairs), run ( query-candidate document pairs), and the corpus into training, dev, and test TFRecord files that will be consumed by BERT. (we need to install CBOR package: pip install cbor)

python convert_treccar_to_tfrecord.py \
  --output_folder=${TRECCAR_DIR}/tfrecord \
  --vocab_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/vocab.txt \
  --corpus=${TRECCAR_DIR}/paragraphCorpus/dedup.articles-paragraphs.cbor \
  --qrels_train=${TRECCAR_DIR}/train.qrels \
  --qrels_dev=${TRECCAR_DIR}/dev.qrels \
  --qrels_test=${TRECCAR_DIR}/test.qrels \
  --run_train=${TRECCAR_DIR}/train.run \
  --run_dev=${TRECCAR_DIR}/dev.run \
  --run_test=${TRECCAR_DIR}/test.run \
  --max_query_length=64\
  --max_seq_length=512 \
  --num_train_docs=10 \
  --num_dev_docs=10 \
  --num_test_docs=1000

This step requires at least 64GB of RAM as we load the entire corpus onto memory.

Training/Evaluating

Before start training, you need to download a BERT Large model pretrained on the training set of TREC-CAR. This pretraining was necessary because the official pre-trained BERT models were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set.

Similar to MS MARCO training, we made available this Google Colab to train and evaluate on TREC-CAR.

In case you opt for not using the Colab, here is the command line to start training:

python run_treccar.py \
  --data_dir=${TRECCAR_DIR}/tfrecord \
  --bert_config_file=${DATA_DIR}/uncased_L-24_H-1024_A-16/bert_config.json \
  --init_checkpoint=${DATA_DIR}/pretrained_models_exp898_model.ckpt-1000000 \
  --output_dir=${TRECCAR_DIR}/output \
  --trec_output=True \
  --do_train=True \
  --do_eval=True \
  --trec_output=True \
  --num_train_steps=400000 \
  --num_warmup_steps=40000 \
  --train_batch_size=32 \
  --eval_batch_size=32 \
  --learning_rate=1e-6 \
  --max_dev_examples=3000 \
  --num_dev_docs=10 \
  --max_test_examples=None \
  --num_test_docs=1000

Because trec_output is set to True, this script will produce a TREC-formatted run file "bert_predictions_test.run". We can evaluate the final performance of our BERT model using the official TREC eval tool, which is included in Anserini:

eval/trec_eval.9.0.4/trec_eval -m map -m recip_rank -c ${TRECCAR_DIR}/test.qrels ${TRECCAR_DIR}/output/bert_predictions_test.run

And the output should be:

map                     all 0.3356
recip_rank              all 0.4787

We made available our run file here.

Trained models

You can download our BERT Large trained on TREC-CAR here.

How do I cite this work?

@article{nogueira2019passage,
  title={Passage Re-ranking with BERT},
  author={Nogueira, Rodrigo and Cho, Kyunghyun},
  journal={arXiv preprint arXiv:1901.04085},
  year={2019}
}