Reproduction of the results

Hi,

For the TREC-CAR part I have re-run your implementation on Colab.

First of all, I have noticed that the warmup step count is 10k and the total train step count is 100k in the Colab implementation which were 40k/400k in the original implementation in this repository.

When I ran the code with 10k/100k setting, I obtained 0.331 MAP score. And, then I ran the same code (Colab) with 40k/400k setting and obtained 0.339. The expected MAP score is 0.336.

After these, I used the official evaluation tool (Anserini toolkit) on the bert_predictions_test.run generated by 40k/400k setting. This way I saw 0.333 as MAP score.

My question is "Is this deviation normal?" I have used the pre-trained model shared in this repository as the initial model. So, there shouldn't be randomness in the initialization. Did you observe this much deviation in the MAP score as well?

Thanks in advance,

nyu-dl / dl4marco-bert

Reproduction of the results #50