Closed JamesDeAntonis closed 3 years ago
You just need to use batch-mode retrieval and ranking!
Just keep in mind it's two steps, not one. There are some instructions in the README. Let me know if you face issues using them.
Batch retrieval loads only the compress FAISS index and retrieves the initial (unsorted) set of passages. Batch re-ranking streams over the index one part at a time, so it uses a tiny fraction of memory at any point.
Very cool!
By two-step, you're referring to how the second (re-ranking) step in end-to-end isn't implemented yet? As suggested here
The second step is implemented. You just need to use a different script colbert.retrieve
then colbert.rerank
(give it the output topk).
What isn't implemented is two steps from one script, which would be nice to have eventually. But this shouldn't affect your goals above!
Yeah, to clarify I meant that we can't fully do end-to-end in one shot, but currently we instead have to call retrieve and then rerank (I think that's what you said)
Precisely! Give it a run. It should be really fast and smooth, I hope :D
This seems to be working properly!
I am also having some pains due to trying to use huggingface's model. I noticed that in the paper it is said that the used output dimension is 128, and that is the default in this repo, but the HF pretrained model uses 768. I plan to use 128 because I don't have space for 768, so I'll probably nix huggingface entirely, outside of how it's used in this repo.
Do you have the dim 128 model saved anywhere, as used in the paper?
Not sure if we corresponded about this by email, but as I mentioned to some other folks, I'm happy to share a checkpoint with you if you reach out by email!
What's the easiest way to use
ColBERT
without loading the full index into memory? We are building an index off of thewiki_dpr
dataset (and eventually more), so we have about 21 million passages and counting. The full index is about 630Gb on disk and we have 230Gb of memory to work with (hopefully not needing nearly the full 230). I understand that faiss allows for this type of search (only metadata gets loaded into memory and the actual vectors stay on disk), so curious whether you support this in the current repo.When running the retrieval script in the README, I run into memory issues once I start building
IndexPart
here. Should I be doing something with theindex_part
param? Any insight would be greatly appreciated.Thanks!
(UPDATE: fyi, the retrieve script runs properly on a tiny dev subset of the dataset)