Trained DistilBERT-based Checkpoint

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

3.06k stars 388 forks source link

Trained DistilBERT-based Checkpoint #25

Closed sebastian-hofstaetter closed 3 years ago

sebastian-hofstaetter commented 3 years ago

Hi,

Thanks for this great model 🎉!

I just published a knowledge-distilled ColBERT checkpoint: https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco It's based on a 6-layer DistilBERT and trained with our Margin-MSE (https://arxiv.org/abs/2010.02666) distillation, it gets up to .375 MRR@10 on MSMARCO-DEV and .744 NDCG@10 on TREC-DL'19 when re-ranking top-1K BM25 results.

The model definition & training code we used (https://github.com/sebastian-hofstaetter/neural-ranking-kd/blob/main/minimal_colbert_usage_example.ipynb) is slightly different then in this repo, but maybe if you are interested we can add our definition as well as another option to easily use the checkpoint?

Best, Sebastian

okhat commented 3 years ago

Hey Sebastian!

Margin-MSE is awesome work---thanks for applying it to ColBERT and releasing the checkpoint! The results are impressive.

I have two thoughts: it would be great to test this out with end-to-end retrieval, but the use of d=768 embedding enlarges the index by a factor of six. We have aggressive quantization for ColBERT, for release very soon, so maybe that will ease this a bit. This will represent each vector with just 32 bytes.

I will take a look at your links. A merge will be really cool!

sebastian-hofstaetter commented 3 years ago

Great :) yes, i did try to do end-to-end retrieval, but faiss did not like the 1tb index on even the largest server i have access to. Does your quantization work on an existing checkpoint? If not I could also retrain a model with Margin-MSE to compress the output vectors to a smaller dimension.

okhat commented 3 years ago

I'm guessing you used FAISS with a large index type, maybe FlatL2 or HNSW.

For ColBERT, we use IVFPQ which decreases the index size dramatically and I've faced no issues with very large indexes (e.g., the full-document version of MS MARCO).

The only challenge I see is how do we reconcile the two model definitions, since there are a couple of differences in the base model (DistilBERT) and in masking.

okhat commented 3 years ago

Hey @sebastian-hofstaetter !

I thought you may be interested to know about our new quantization branch. By default, it represents each vector in just 32 bytes. I generally get very similar results with this to using the full 128-dim embedding, which use 256 bytes.

littlewine commented 3 years ago

maybe I am missing something, but is there a model checkpoint of the pretrained encoder of the original work somewhere (pretrained on MSMarco) in this repo? or do we have to retrain train from scratch to use the model on a different collection?

On huggingface, I found https://huggingface.co/sebastian-hofstaetter/colbert-distilbert-margin_mse-T2-msmarco -- thanks @sebastian-hofstaetter and also https://huggingface.co/vespa-engine/colbert-medium which seems to be what I wanted (but not from the orig authors). Maybe good to add it in the readme @okhat ?