Open D-X-Y opened 5 years ago
I think that if you have already worked on getting a LMDB out of ImageNet individual files, then writing a custom dataset
for it should be straightforward, right?
If we already have an LMDB file, yes. Is that possible to integrate building LMDB in the initial function of ImageNet-LMDB dataset class
?
We could.
But LMDB also has some downsides, like https://github.com/pytorch/vision/issues/619. I'm not sure if we would like to encourage its use, at least not as of now.
@fmassa Thanks for your reply. Do you have some recommendations for database for ImageNet?
doesn't the current ImageNet
format (unzipped images) work for you?
I'm using a system that can not handle many small files (such as 1M PNG images). Therefore, I can not use raw ImageNet images, but have to use one or a few large files to save the whole dataset.
Something that kind of helped me with Imagenet data loading was using TensorFlow Dataset's from_numpy method, along with TFDS prepackaged ImageNet dataset in TFRecords format. It wasn't ideal and there was room for further optimization for PyTorch ingestion, but it did speed up dataloading a ton on a machine with only HDD available.
I came across the same problems.. too many small files... I need to find a way to speed up the dataloading. tried lmdb lsun.py LMDB example not working. @RicCu could you share more tips of combining TFDS into dataloader?
Hi, I haven't worked that much more on getting tfds to work great with dataloaders, but you might wanna take a look at @vahidk's tfrecords reader. It has nice interop with PyTorch's dataloaders without depending on TF.
Is it possible to support LMDB for ImageNet as this one
https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py#L22
. One benefit is that we do not need to save 1.28 M small images on the disk, instead, we can save the whole ImageNet into one single file (maybe several files).