stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.67k stars 355 forks source link

Returning passage ID in addition to passage index #326

Open jenhsia opened 3 months ago

jenhsia commented 3 months ago

In the original repo, the index corpus tsv file requires that the pid is an integer, but there may be cases where we want it to use passage id (string) instead of passage index (int). These commits allow pid to be a non-integer and allows easy access of the passage ids after passage ranking.

If we save the passage-index-to-passage-id list (pid_list) in the searcher.collection, then we can use it to easily access passage_id after ranking as follows.

for query_id in ranking.data:
    for (passage_index, rank, score) in ranking.data[query_id]:
        passage_id = searcher.collection.pid_list[passage_index]
timbmg commented 2 months ago

Thanks @jenhsia! This is also something that would be very helpful to me. Would be great if one of the maintainers could check this? 😇 @santhnm2 @okhat

timbmg commented 2 months ago

BTW, it would also be good to remove the requirement for qids to be integers. @jenhsia, maybe you could amend your PR and also comment in evaluation/loaders.py

qid = int(qid)