long document ranking - Githubissues

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.84k stars 372 forks source link

long document ranking #11

Closed gm0616 closed 3 years ago

gm0616 commented 3 years ago

Hi, when it comes to long document ranking, how to use colbert to solve the problem? I find that your team have submitted "ColBERT MaxP end-to-end" model to MS MARCO Document Ranking Leaderboard, would you mind release the code and update in this repository?

okhat commented 3 years ago

Hi @gm0616,

Actually, the code is almost exactly the same, with a couple of additional short scripts. To start, you just need to split the long documents into passages and create triples for supervision. "MaxP" just means MaxPassage, that is, we assign each document the score of its highest-scoring passage.

I agree releasing the extra instructions and scripts will be useful for others. I will update here on this.

gm0616 commented 3 years ago

Thanks for a detailed reply. I probably got the idea of the process.

To start, you just need to split the long documents into passages and create triples for supervision.

And what are your experiment settings here?

maximum length of the passage?
doc stride?
how to get triplets? Are all other passages are used as negative samples, or just a few passages are sampled as hard negatives?

okhat commented 3 years ago

The maximum length was 450 BERT tokens. I applied no hyperparameter tuning over this, however, so it's possible that other choices work too. The doc stride was also untuned. For this long-document task, I think it's around 60 BERT tokens, irrc.

To get initial training triples, we use zero-shot transfer of ColBERT trained on the passage task of MS MARCO. I suspect you can also start from BM25. We then divide the top-1000 passages retrieved into buckets: positive, negative, ignored. Negatives are passages that come from negative documents, positives are the best (one or more) passages that come from the positive document, and the rest are ignored (they are technically very weak positives; hence, not used in training).

gm0616 commented 3 years ago

Thank you for the fast response! The method of constructing triplets sounds like a great idea, I will try that as well. Thanks!!

okhat commented 3 years ago

By the way, let me know if you need the ColBERT ranking output on this task (e.g., if you'd like to re-rank it). We're happy to share/release it.

ashokrajab commented 1 year ago

I think it's around 60 BERT tokens

I wonder whether using a sliding window technique in any way hinder the retrieval of lower rank documents.

In a sliding window implementation, the same token will appear in multiple segments. So during retrieval, in the first stage, among all the top k' token embeddings, many will essentially be a slight variation of the same token. This in essence would prevent the tokens from other documents from being retrieved.

Is this something one needs to be wary of, @okhat?