Closed Diegi97 closed 4 months ago
Looks perfect. Do you mind just adding a note here:
So we know that 64 is probably ignored internally, and tell the user that they can supply index_bsize
Should I remove that and leave it like self.configure(partitions=None)
? To avoid confusion, it may be better to remove it than to leave a note.
I'd take it out since you're setting the default in settings.py.
I see your point but keeping it guarantees the current behavior doesn’t break. The bsize technically may affect any other things we didn’t consider. A note is enough imo.
Note added, thanks for you feedback!
Hi Deigo, are you seeing a speed increase from using a larger batch size? I'm not:
Indexing speed comparison, indexing 526,064 passages with 3x Nvidia V100 (32GB)
Batch Size | Time
---
64 | 24m53.029s
128 | 23m52.150s
256 | 24m30.420s
512 | 24m01.362s
1024 | Crashed OOM
I did it mainly because the CPU encoding is faster with lower batch sizes.
This pull request addresses issue #300, allowing customization of the batch size used during the encoding in the indexing process in ColBERT.
Previously, the batch size for encoding documents when indexing was fixed to 64. Now you can specify it in the ColBERTConfig like this:
config = ColBERTConfig( root="/path/to/experiments", index_bsize=16 )