Closed gm0616 closed 3 years ago
Hi @gm0616,
Actually, the code is almost exactly the same, with a couple of additional short scripts. To start, you just need to split the long documents into passages and create triples for supervision. "MaxP" just means MaxPassage, that is, we assign each document the score of its highest-scoring passage.
I agree releasing the extra instructions and scripts will be useful for others. I will update here on this.
Thanks for a detailed reply. I probably got the idea of the process.
To start, you just need to split the long documents into passages and create triples for supervision.
And what are your experiment settings here?
The maximum length was 450 BERT tokens. I applied no hyperparameter tuning over this, however, so it's possible that other choices work too. The doc stride was also untuned. For this long-document task, I think it's around 60 BERT tokens, irrc.
To get initial training triples, we use zero-shot transfer of ColBERT trained on the passage task of MS MARCO. I suspect you can also start from BM25. We then divide the top-1000 passages retrieved into buckets: positive, negative, ignored. Negatives are passages that come from negative documents, positives are the best (one or more) passages that come from the positive document, and the rest are ignored (they are technically very weak positives; hence, not used in training).
Thank you for the fast response! The method of constructing triplets sounds like a great idea, I will try that as well. Thanks!!
By the way, let me know if you need the ColBERT ranking output on this task (e.g., if you'd like to re-rank it). We're happy to share/release it.
I think it's around 60 BERT tokens
I wonder whether using a sliding window technique in any way hinder the retrieval of lower rank documents.
In a sliding window implementation, the same token will appear in multiple segments. So during retrieval, in the first stage, among all the top k' token embeddings, many will essentially be a slight variation of the same token. This in essence would prevent the tokens from other documents from being retrieved.
Is this something one needs to be wary of, @okhat?
Hi, when it comes to long document ranking, how to use colbert to solve the problem? I find that your team have submitted "ColBERT MaxP end-to-end" model to MS MARCO Document Ranking Leaderboard, would you mind release the code and update in this repository?