stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.89k stars 374 forks source link

expected size of index ? #143

Closed lboesen closed 1 year ago

lboesen commented 1 year ago

Hi,

First of all, great work!

I wanted to hear if you have an idea of the size of the resulting index, when using MSMARCO(v1) passage ranking dataset ?

danielfleischer commented 1 year ago

Not the author but we can report that when using the MSMARCO checkpoint with a collection of size ~20M passages, 128d, 2bit quantization, we get an index size of 77GB.

lboesen commented 1 year ago

Thanks for your comment ! Im guessing then with the msmarco collection of ~8.8M passages with similar params will be around 40GB.

okhat commented 1 year ago

@lboesen Which branch are you using?

The v1 index sizes are in https://arxiv.org/abs/2004.12832 (ms marco ~140GB iirc)

The v2 (main) index sizes are in https://arxiv.org/abs/2205.09707 (ms marco with nbits=2 ~22GB)

Hope this helps. Closing.