nomic-ai / contrastors

Train Models Contrastively in Pytorch
Apache License 2.0
512 stars 37 forks source link

How to implement "fill the entire batch with samples from that single source" #39

Closed allhailzzq closed 3 months ago

allhailzzq commented 4 months ago

Hi there, thanks for sharing this great repo!

From your paper, I notice a paragraph says

"During training, we sample pairs from one data source at a time and fill the entire batch with samples from that single source to discourage the model from learning source-specific shortcuts."

However, by reading src/contrastors/dataset/torch_loader.py, I did not find a corresponding setting. I am just wondering if I missed anything. Could you help me go through (or point out) the script to implement this batching strategy? Thanks a lot!

zanussbaum commented 4 months ago

hey thanks for the great question, I implemented a quite custom dataloader so we could stream large amounts of data from the cloud and not store anything/everything on disk. The bulk of the logic for that is here: https://github.com/nomic-ai/contrastors/blob/main/src/contrastors/dataset/torch_loader.py#L288-L294 We sample a path from the list of datasets, stream in the next part of the batch, and then update the pointer so next time we know where to pick up from