Implement WebDataset - Githubissues

afiaka87 commented 3 years ago

Edit: @robvanvolt is right

WebDataset is perfect for us - any dataset already in the format expect by the TextImageDataset we have now can easily be converted to a WebDataset by placing them in ~512MiB subdirectories and tarring up each one. Each tar is now considered a WebDataset "shard" which can be efficiently loaded. Everything in WebDataset communicates via HTTPS - even on localhost. As such, there is little distinction between a list of URLS containing archives and a list of paths containing the archives. This has a number of benefits for helping with distributed, and for dealing with massive datasets which can't possibly all load onto disk at once e.g. Previews (6 TiB) - or at least not for many people.

robvanvolt commented 3 years ago

Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)

afiaka87 commented 3 years ago

Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)

I agree. WebDataset is perfect for our usecase.

robvanvolt commented 3 years ago

Added early beta support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280

robvanvolt commented 3 years ago

Added early beta support for WebDatasets:

280

Added full support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280 Does anyone want to try the new feature out or review the changes?

lucidrains / DALLE-pytorch

Implement WebDataset #216

280