Closed allhailzzq closed 3 months ago
hey thanks for the great question, I implemented a quite custom dataloader so we could stream large amounts of data from the cloud and not store anything/everything on disk. The bulk of the logic for that is here: https://github.com/nomic-ai/contrastors/blob/main/src/contrastors/dataset/torch_loader.py#L288-L294 We sample a path from the list of datasets, stream in the next part of the batch, and then update the pointer so next time we know where to pick up from
Hi there, thanks for sharing this great repo!
From your paper, I notice a paragraph says
"During training, we sample pairs from one data source at a time and fill the entire batch with samples from that single source to discourage the model from learning source-specific shortcuts."
However, by reading src/contrastors/dataset/torch_loader.py, I did not find a corresponding setting. I am just wondering if I missed anything. Could you help me go through (or point out) the script to implement this batching strategy? Thanks a lot!