Neural IR: Cross-Architecture Knowledge Distillation

Welcome 🙌 to the hub-repo of our paper:

Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan and Allan Hanbury

https://arxiv.org/abs/2010.02666

tl;dr We utilize an ensemble of BERT_CAT models (the vanilla BERT passage re-ranking model) to teach & improve a range of other more efficient architectures for (re-)ranking with a Margin-MSE loss. We publish the teacher training files for everyone to use here 🎉 We are sure the community can do very cool things with these training files 😊

If you have any questions, suggestions, or want to collaborate please don't hesitate to get in contact with us via Twitter or mail to s.hofstaetter@tuwien.ac.at

The knowledge distillation workflow; we provide the "Result Store" in this repo.

Please cite our work as:

@misc{hofstaetter2020_crossarchitecture_kd,
      title={Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation}, 
      author={Sebastian Hofst{\"a}tter and Sophia Althammer and Michael Schr{\"o}der and Mete Sertkan and Allan Hanbury},
      year={2020},
      eprint={2010.02666},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

Pre-Trained Models

We provide the following full-trained 6 layer DistilBERT-based models (trained with Margin-MSE using a 3 teacher BERT_CAT ensemble (T2 in the paper) on MSMARCO-Passage) via the HuggingFace model hub:

BERT_DOT (re-ranking and dense-retrieval model): https://huggingface.co/sebastian-hofstaetter/distilbert-dot-margin_mse-T2-msmarco

The DistilBERT architecture should work as plug 'n play everywhere you use the transformer library. For a minimal usage example of the BERT_DOT model check out our notebook minimal_bert_dot_usage_example.ipynb here or
ColBERT (re-ranking and dense-retrieval model): https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco

For a minimal usage example of our ColBERT checkpoint check out our notebook minimal_colbert_usage_example.ipynb here or
BERT_CAT (re-ranking model): https://huggingface.co/sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco

For a minimal usage example of the BERT_CAT model check out our notebook minimal_bert_cat_usage_example.ipynb here or
PreTTR (re-ranking model): https://huggingface.co/sebastian-hofstaetter/prettr-distilbert-split_at_3-margin_mse-T2-msmarco

For a minimal usage example of our PreTTR checkpoint check out our notebook minimal_prettr_usage_example.ipynb here or

If you have a specific request for a pre-trained model from the paper, please create an issue here :)

Teacher Training Files (MSMARCO-Passage)

We publish the training files without the text content instead using the ids from MSMARCO; for the text content please download the files from the MSMARCO Github Page and use the helper script (teacher_id_to_text.py) in this repo to expand the id files with the fulltext, if necessary.

The teacher files (using the data from "Train Triples Small" with ~40 million triples) with the format pos_score neg_score query_id pos_passage_id neg_passage_id (with tab separation) are:

(T1) BERT-Base_CAT
(T2) Mean-Ensemble of BERT-Base_CAT + BERT-Large_CAT + ALBERT-Large_CAT

available at Zenodo: https://zenodo.org/record/4068216

Source Code

The full source-code for our paper is here, as part of our matchmaker library: https://github.com/sebastian-hofstaetter/matchmaker

We have getting started guides for training teachers & students, as well as a range of other possibilities surrounding the Margin-MSE loss.

Cross Architecture Knowledge Distillation

The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves their effectiveness without compromising their efficiency.

Efficiency vs. Effectiveness Results from the paper Efficiency vs. effectiveness on two query sets and their main effectiveness metrics, T1 and T2 are the knowledge distillation trained models with BERT-Base_CAT (T1) and the ensemble (T2)

sebastian-hofstaetter / neural-ranking-kd

readme

Neural IR: Cross-Architecture Knowledge Distillation

Pre-Trained Models

Teacher Training Files (MSMARCO-Passage)

Source Code

Cross Architecture Knowledge Distillation