Closed puzzlecollector closed 2 years ago
This file is not not "ordered". This is just a bag of candidates per query. You must then use colbert.rerank --batch to rerank these candidates.
This happens only when using batch mode.
@okhat I am planning to use batch mode, because if I do not use batch mode I end up with "cuda out of memory" issue. So with the batch mode if I obtain the generated unordered.tsv file, then can I run the rerank process using the following command (does it look right)?
!python -m colbert.rerank --batch --log-scores --topk 500 \
--query_maxlen 512 --doc_maxlen 512 --mask-punctuation \
--checkpoints ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
--amp --queries small_test_queries.tsv \
--collection ./experiments/dirty/retrieve.py/2021-12-08_02.15.14/unordered.tsv \
--index_root ./experiments/indexes --index_name large_train_index
It looks right yes, but batch mode will not really help you with cuda OOM.
It's fine to use batch mode, though. You might have to modify the BSIZE in the ranking code to basically handle fewer documents at a time, since your doc_maxlen and query_maxlen are extremely large by typical standards. I'd suggested 256x256, but even that is fairly large.
@okhat
The bsize parameter can be changed in the file colbert/ranking/retrieval.py right?
So this file? https://github.com/stanford-futuredata/ColBERT/blob/abb5b684e4cda297f3ff58b52e70d5b90270e900/colbert/ranking/retrieval.py
Keep using batch mode. And modify this line:
Reduce it to 1 << 9. This should work!
This only affects re-ranking. You don't need to re-run the retrieve step.
Can I reduce the batch size in that file and use retrieval only? (not the batch mode)
It can be made to work, but it might require some code changes. You'll need to basically do the work in a loop over batches.
What I suggested is easier.
I see. After batch retrieval (passing in --batch --retrieve_only to colbert.retrieve), I get an unordered.tsv and if I want to rerank the unordered.tsv, then I need to pass in unordered.tsv as an argument to colbert.rerank. Should the argument for unordered.tsv be for --topk or for --collection ?
@okhat Okay so I am guessing the --topk argument takes in the path to the file unordered.tsv because the reranking task reranks the top k document candidates that was retrieved during the retrieval process?
--topk
Yes indeed
@okhat Another quick question - it seems like the output of ranking.tsv (after the rerank command) has final results with index + 1?
This is an example. For query id 10, the similar document is document id 10, but 11 is printed out as rank 1 instead.
This is another example. For query id 11, the similar documents are 12266, 16549 but the documents 12267 and 16550 are ranked in the top 1 and top 3 respectively.
and of course I assume the ranking.tsv contains (query id, document id, rank, score) as columns
Can I see the head of your collection.tsv? and queries.tsv? These are the qids and pids used assuming the code wasn't changed
@okhat
test_collection.tsv
small test queries
There are about 38,000 queries and candidate document size is 78,000, but just for testing purposes I used 3 query samples.
This is the command used for retrieval
!python -m colbert.retrieve --amp --doc_maxlen 512 --query_maxlen 512 --bsize 256 \
--queries small_test_queries.tsv --partitions 65536 --index_root ./experiments/indexes --index_name large_train_index \
--checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn --batch --retrieve_only
and this is the command used for reranking
!python -m colbert.rerank --batch --log-scores --topk ./experiments/dirty/retrieve.py/2021-12-08_02.15.14/unordered.tsv \
--query_maxlen 512 --doc_maxlen 512 --mask-punctuation \
--checkpoint ./experiments/dirty/train.py/2021-12-06_08.01.48/checkpoints/colbert-32000.dnn \
--amp --queries small_test_queries.tsv \
--index_root ./experiments/indexes --index_name large_train_index --bsize 16
Okay, which passage starts with "A sign language". Are you saying it's passage 10?
Yes that's right. It's passage with id 10.
Are you sure? The code will not add +1 to passage IDs
That is odd... then it means that the relevant document did not appear in the top 1000, but it seems too much of a coincidence for the top ranking documents to be just off by 1 index with the actual documents that are relevant to the query.
Definitely. This is not a coincidence for both queries. It's finding the right document, but something in the setup pushes if off by one. The code doesn't do that, though.
You might wanna check your indexing code, just to be sure
script*, not code
I will look into this issue and let you know if I find something.
This is an unrelated question but how does batch retrieve actually work? Is this like first stage retrieval where it retrieves a bunch of relevant candidate documents from the entire corpus? Does it involve calculating scores as well (so it calculates scores to roughly retrieve candidate documents but does not rank them and instead leaves this for the reranking step?)
The first step just finds any document that has at least one embedding that's nearby to at least one query vector.
This doesn't involve scoring the full document.
I see oh and I think I might know why that +1 thing happened - I remember fixing the code somewhere where an assert statement occured for checking pid. Something to do with line index.
So I remember it being like: For the first row, which is the column we have (id, query), and assert statement checks that pid is 'id' so it runs without problem. But for the next row, it checks that line_idx (which is 1 at this point) does not match the id (which starts from zero) so I did something like
assert pid == 'id' or pid == line_idx - 1
I think that is probably why. So I guess I had to make all ids start from 1 instead of 0.
@okhat okay so here it is: this is what I changed in colbert/indexing/encoder.py file
Cool looks like you figured it out!
@okhat
Because I have lots of queries that I want to process, I wanted to train in batches so I used the following command for retrieval:
And in doing so it creates a file "unordered.tsv" , but the results in the file look weird:
From my understanding, the columns are (query id, document id, rank), but the rank column is filled up with -1.
When I run validation on a single query by using ColBERT on the fly, it produces pretty good results, but of course it is slow because I have not done the necessary preprocessing (However, I believe this suggests that my model has been trained properly, so the issue probably does not have to do with BERT itself).