stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.89k stars 374 forks source link

Training ColBERT via regression rather than classification #82

Closed paul-chelarescu closed 2 years ago

paul-chelarescu commented 2 years ago

Hi @okhat, thank you for open-sourcing ColBERT! I have a question about the training regime and the dataset used in this model.

As to my understanding, ColBERT is trained via classification on the MS MARCO triplets, which are sparse samples of query-passage positive pairs marked by a human judger together with a negative pair that is sampled at random (or more sophisticatedly sampled to trick the model into thinking it's a positive pair). Going by the design decisions taken in the construction of the MS MARCO dataset, it seems like the choice for using a classification target rather than a regression target is because there were very sparse human judgment classifications available.

https://github.com/microsoft/MSMARCO-Document-Ranking#generation In particular, the passage which quotes "understanding that this ranking data may not be useful to train a Deep Learning style system we build the triples files", where the ranking data seems to refer to binary labels on relevancy, not an actual list of documents ranked by their relevancy to the query, is a passage that eludes me. Why wouldn't a dense (such as machine rather than human generated) ranking dataset be useful for training a DL model? From the description of the dataset creation, I userstand that the system goals are to rank any of these binarily classified passages as highly as possible. Why wouldn't a training dataset that resembles the evaluation task not be an appropriate dataset?

In contrast to the MS MARCO Document Rankings dataset, what if I have access to the entire ranking order for each query, generated not by a human judger but by an expensive ranker, such as Word Mover Distance or a cross-encoder? It seems that because human judgers were only asked to binarily classify a sparse amount of passages as being relevant or not, ColBERT was subsequently trained as a classification task, but this wouldn't preclude someone from training ColBERT as a regression task if we had the adequate dataset. Is my assumption correct?

For instance, say I take a corpus of 1m documents and for each document taken from a uniform 10% sample over the 1m documents I generate a list of the 10 most similar documents, where I have a similarity score created by Word Mover Distance from which I can extract a rankings order. I would end up with 10m triplets in the form of <doc_id1, doc_id2, similarity_score>. I would clearly be able to reduce this dataset to a classification task similar to the way MS MARCO was created, by thresholding over a similarity score to represent "is_selected", and make triplets in the form of <doc_id1, doc_id2 positive, doc_id3 negative>, but wouldn't I lose useful information held in the similarity score? The way I pose the problem I would have to ask ColBERT to perform a regression between the documents in doc_id1 and doc_id2 to generate the similarity_score, but would this be feasible for ColBERT, or even a good idea to pursue?

Right now I am going by the assumption that I can generate high quality similarity scores between documents using Word Mover Distance (which is an expensive O(n^2) operation), and I could train a (faster than O(n^2)) transformer model like ColBERT to emulate WMD while at the same time leveraging the knowledge baked in a pre-trained Language Model. But perhaps my assumption of using WMD as a quality regression label is wrong, in which case I would need to rethink the way in which I address my problem, which ultimately is performing document simililarity on large (>20m) corpora.

Would you have any advice or thoughts on the above? Thank you very much!

okhat commented 2 years ago

Hi @paul-chelarescu ! Apologies for the slow response.

Your understanding on typical triples-based training is correct. So is your intuition that training with denser labels should help. The key challenge is how to find such dense labels, and indeed using a strong cross-encoder for score-distribution distillation (e.g., with KL-divergence as in RocketQAv2) is very effective. I strongly doubt this would apply the word mover distance though.

Feel free to check out the supervision section of our ColBERTv2 paper for the approach we take there. Margin-MSE and RocketQA are two early systems with distillation that are quite useful in this respect.