Open afiaka87 opened 3 years ago
Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)
Maybe we should convert our datasets to tar.gz and work with WebDatasetloaders? This might be a faster approach than optimizing reading text and image folders, doesn't it? But cool anyway to get some extra speed of the current approach! (Y)
I agree. WebDataset is perfect for our usecase.
Added early beta support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280
Added early beta support for WebDatasets:
280
Added full support for WebDatasets: https://github.com/lucidrains/DALLE-pytorch/pull/280 Does anyone want to try the new feature out or review the changes?
Edit: @robvanvolt is right
WebDataset is perfect for us - any dataset already in the format expect by the
TextImageDataset
we have now can easily be converted to a WebDataset by placing them in ~512MiB subdirectories and tarring up each one. Each tar is now considered a WebDataset "shard" which can be efficiently loaded. Everything in WebDataset communicates via HTTPS - even on localhost. As such, there is little distinction between a list of URLS containing archives and a list of paths containing the archives. This has a number of benefits for helping with distributed, and for dealing with massive datasets which can't possibly all load onto disk at once e.g. Previews (6 TiB) - or at least not for many people.