troubleshooting encoding performance

stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)

MIT License

2.67k stars 355 forks source link

cf = ColBERTConfig(checkpoint='checkpoints/colbertv2.0') cp = Checkpoint(cf.checkpoint, colbert_config=cf) encoder = CollectionEncoder(cf, cp) passages = ... encoder.encode_passages(passages)

Few questions:

Have you tried using the PyTorch data profiler? https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html I'd probably start there.
How are you loading the data? It looks like your dataset is loaded from memory, but I want to confirm there's not an issue with the loading step. PyTorch has specific classes:
- Creating a custom dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files
- Creating a data loader for that dataset: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders
What value are you setting for index_bsize ? You probably want to increase this value until it breaks and then bring it back down. If data transfers are frequently going back and forth between the CPU and GPU, that will bottleneck a lot of GPU processing.

stanford-futuredata / ColBERT

troubleshooting encoding performance #301