stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
3.05k stars 388 forks source link

question about unordered.tsv #81

Closed puzzlecollector closed 2 years ago

puzzlecollector commented 2 years ago

@okhat

Because I have lots of queries that I want to process, I wanted to train in batches so I used the following command for retrieval:

!python -m colbert.retrieve --amp --doc_maxlen 512 --query_maxlen 512 --bsize 1 \
--queries small_test_queries.tsv --partitions 65536 --index_root ./experiments/indexes --index_name large_train_index \
--checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
--depth 10000 --batch --retrieve_only

And in doing so it creates a file "unordered.tsv" , but the results in the file look weird:

Screen Shot 2021-12-07 at 7 18 37 PM

From my understanding, the columns are (query id, document id, rank), but the rank column is filled up with -1.

When I run validation on a single query by using ColBERT on the fly, it produces pretty good results, but of course it is slow because I have not done the necessary preprocessing (However, I believe this suggests that my model has been trained properly, so the issue probably does not have to do with BERT itself).

okhat commented 2 years ago

This file is not not "ordered". This is just a bag of candidates per query. You must then use colbert.rerank --batch to rerank these candidates.

This happens only when using batch mode.

puzzlecollector commented 2 years ago

@okhat I am planning to use batch mode, because if I do not use batch mode I end up with "cuda out of memory" issue. So with the batch mode if I obtain the generated unordered.tsv file, then can I run the rerank process using the following command (does it look right)?

!python -m colbert.rerank --batch --log-scores --topk 500 \
--query_maxlen 512 --doc_maxlen 512 --mask-punctuation \
--checkpoints ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
--amp --queries small_test_queries.tsv \
--collection ./experiments/dirty/retrieve.py/2021-12-08_02.15.14/unordered.tsv \
--index_root ./experiments/indexes --index_name large_train_index
okhat commented 2 years ago

It looks right yes, but batch mode will not really help you with cuda OOM.

It's fine to use batch mode, though. You might have to modify the BSIZE in the ranking code to basically handle fewer documents at a time, since your doc_maxlen and query_maxlen are extremely large by typical standards. I'd suggested 256x256, but even that is fairly large.

puzzlecollector commented 2 years ago

@okhat

The bsize parameter can be changed in the file colbert/ranking/retrieval.py right?

So this file? https://github.com/stanford-futuredata/ColBERT/blob/abb5b684e4cda297f3ff58b52e70d5b90270e900/colbert/ranking/retrieval.py

okhat commented 2 years ago

Keep using batch mode. And modify this line:

https://github.com/stanford-futuredata/ColBERT/blob/abb5b684e4cda297f3ff58b52e70d5b90270e900/colbert/ranking/index_ranker.py#L11

Reduce it to 1 << 9. This should work!

okhat commented 2 years ago

This only affects re-ranking. You don't need to re-run the retrieve step.

puzzlecollector commented 2 years ago

Can I reduce the batch size in that file and use retrieval only? (not the batch mode)

okhat commented 2 years ago

It can be made to work, but it might require some code changes. You'll need to basically do the work in a loop over batches.

What I suggested is easier.

puzzlecollector commented 2 years ago

I see. After batch retrieval (passing in --batch --retrieve_only to colbert.retrieve), I get an unordered.tsv and if I want to rerank the unordered.tsv, then I need to pass in unordered.tsv as an argument to colbert.rerank. Should the argument for unordered.tsv be for --topk or for --collection ?

puzzlecollector commented 2 years ago

@okhat Okay so I am guessing the --topk argument takes in the path to the file unordered.tsv because the reranking task reranks the top k document candidates that was retrieved during the retrieval process?

okhat commented 2 years ago

--topk

okhat commented 2 years ago

Yes indeed

puzzlecollector commented 2 years ago

@okhat Another quick question - it seems like the output of ranking.tsv (after the rerank command) has final results with index + 1?

puzzlecollector commented 2 years ago
Screen Shot 2021-12-08 at 11 52 45 AM

This is an example. For query id 10, the similar document is document id 10, but 11 is printed out as rank 1 instead.

Screen Shot 2021-12-08 at 11 53 04 AM

This is another example. For query id 11, the similar documents are 12266, 16549 but the documents 12267 and 16550 are ranked in the top 1 and top 3 respectively.

puzzlecollector commented 2 years ago

and of course I assume the ranking.tsv contains (query id, document id, rank, score) as columns

okhat commented 2 years ago

Can I see the head of your collection.tsv? and queries.tsv? These are the qids and pids used assuming the code wasn't changed

puzzlecollector commented 2 years ago

@okhat

test_collection.tsv

Screen Shot 2021-12-08 at 11 59 27 AM
puzzlecollector commented 2 years ago

small test queries

Screen Shot 2021-12-08 at 11 59 51 AM
puzzlecollector commented 2 years ago
Screen Shot 2021-12-08 at 12 00 13 PM

There are about 38,000 queries and candidate document size is 78,000, but just for testing purposes I used 3 query samples.

puzzlecollector commented 2 years ago

This is the command used for retrieval

!python -m colbert.retrieve --amp --doc_maxlen 512 --query_maxlen 512 --bsize 256 \
--queries small_test_queries.tsv --partitions 65536 --index_root ./experiments/indexes --index_name large_train_index \
--checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn --batch --retrieve_only 

and this is the command used for reranking

!python -m colbert.rerank --batch --log-scores --topk ./experiments/dirty/retrieve.py/2021-12-08_02.15.14/unordered.tsv \
--query_maxlen 512 --doc_maxlen 512 --mask-punctuation \
--checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
--amp --queries small_test_queries.tsv \
--index_root ./experiments/indexes --index_name large_train_index --bsize 16
okhat commented 2 years ago

Okay, which passage starts with "A sign language". Are you saying it's passage 10?

puzzlecollector commented 2 years ago

Yes that's right. It's passage with id 10.

okhat commented 2 years ago

Are you sure? The code will not add +1 to passage IDs

puzzlecollector commented 2 years ago

That is odd... then it means that the relevant document did not appear in the top 1000, but it seems too much of a coincidence for the top ranking documents to be just off by 1 index with the actual documents that are relevant to the query.

okhat commented 2 years ago

Definitely. This is not a coincidence for both queries. It's finding the right document, but something in the setup pushes if off by one. The code doesn't do that, though.

okhat commented 2 years ago

You might wanna check your indexing code, just to be sure

okhat commented 2 years ago

script*, not code

puzzlecollector commented 2 years ago

I will look into this issue and let you know if I find something.

This is an unrelated question but how does batch retrieve actually work? Is this like first stage retrieval where it retrieves a bunch of relevant candidate documents from the entire corpus? Does it involve calculating scores as well (so it calculates scores to roughly retrieve candidate documents but does not rank them and instead leaves this for the reranking step?)

okhat commented 2 years ago

The first step just finds any document that has at least one embedding that's nearby to at least one query vector.

This doesn't involve scoring the full document.

puzzlecollector commented 2 years ago

I see oh and I think I might know why that +1 thing happened - I remember fixing the code somewhere where an assert statement occured for checking pid. Something to do with line index.

So I remember it being like: For the first row, which is the column we have (id, query), and assert statement checks that pid is 'id' so it runs without problem. But for the next row, it checks that line_idx (which is 1 at this point) does not match the id (which starts from zero) so I did something like

assert pid == 'id' or pid == line_idx - 1

puzzlecollector commented 2 years ago

I think that is probably why. So I guess I had to make all ids start from 1 instead of 0.

puzzlecollector commented 2 years ago
Screen Shot 2021-12-08 at 12 33 56 PM

@okhat okay so here it is: this is what I changed in colbert/indexing/encoder.py file

okhat commented 2 years ago

Cool looks like you figured it out!