Very large dataset (BookCorpus) 12 GB in Torchtext

pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch

https://pytorch.org/text

BSD 3-Clause "New" or "Revised" License

3.5k stars 814 forks source link

Very large dataset (BookCorpus) 12 GB in Torchtext #1008

Open thak123 opened 4 years ago

thak123 commented 4 years ago

❓ Questions and Help

Description Hi I am training quick though task dot(sent1~ sent2) and dot(sent2~sent2) but my dataset is 12 GB and it throws memory error once it crosses 575 GB in memory. I am training the system on a Cluster but its doesnt allow more than that to be utilized.

Is there an example of having dataset that can be in slices moved to memory and not at once.

zhangguanheng66 commented 4 years ago

Here is the function (link) that I used to load BookCorpus dataset (we don't distribute BookCorpus because there is no publicly available link). I believe you can just load the amount of the files (extracted_files) that just fit your memory. And if you use DDP, you can check out different set of files for each machine/dataloader and train the model.

thak123 commented 3 years ago

Thanks for the link