stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.67k stars 355 forks source link

Customize batch size for document encoding in ColBERT indexing #303

Closed Diegi97 closed 4 months ago

Diegi97 commented 4 months ago

This pull request addresses issue #300, allowing customization of the batch size used during the encoding in the indexing process in ColBERT.

Previously, the batch size for encoding documents when indexing was fixed to 64. Now you can specify it in the ColBERTConfig like this: config = ColBERTConfig( root="/path/to/experiments", index_bsize=16 )

okhat commented 4 months ago

Looks perfect. Do you mind just adding a note here:

https://github.com/stanford-futuredata/ColBERT/blob/0bb135845c2b237a0b5f11b8f9a20a838924d395/colbert/indexer.py#L64

So we know that 64 is probably ignored internally, and tell the user that they can supply index_bsize

Diegi97 commented 4 months ago

Should I remove that and leave it like self.configure(partitions=None)? To avoid confusion, it may be better to remove it than to leave a note.

detaos commented 4 months ago

I'd take it out since you're setting the default in settings.py.

okhat commented 4 months ago

I see your point but keeping it guarantees the current behavior doesn’t break. The bsize technically may affect any other things we didn’t consider. A note is enough imo.

Diegi97 commented 4 months ago

Note added, thanks for you feedback!

detaos commented 4 months ago

Hi Deigo, are you seeing a speed increase from using a larger batch size? I'm not:

Indexing speed comparison, indexing 526,064 passages with 3x Nvidia V100 (32GB)

Batch Size | Time
---
64         | 24m53.029s
128        | 23m52.150s
256        | 24m30.420s
512        | 24m01.362s
1024       | Crashed OOM
Diegi97 commented 4 months ago

I did it mainly because the CPU encoding is faster with lower batch sizes.