pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.24k stars 6.95k forks source link

[Feature Request] LMDB Dataset for ImageNet #915

Open D-X-Y opened 5 years ago

D-X-Y commented 5 years ago

Is it possible to support LMDB for ImageNet as this one https://github.com/pytorch/vision/blob/master/torchvision/datasets/lsun.py#L22. One benefit is that we do not need to save 1.28 M small images on the disk, instead, we can save the whole ImageNet into one single file (maybe several files).

fmassa commented 5 years ago

I think that if you have already worked on getting a LMDB out of ImageNet individual files, then writing a custom dataset for it should be straightforward, right?

D-X-Y commented 5 years ago

If we already have an LMDB file, yes. Is that possible to integrate building LMDB in the initial function of ImageNet-LMDB dataset class?

fmassa commented 5 years ago

We could.

But LMDB also has some downsides, like https://github.com/pytorch/vision/issues/619. I'm not sure if we would like to encourage its use, at least not as of now.

D-X-Y commented 5 years ago

@fmassa Thanks for your reply. Do you have some recommendations for database for ImageNet?

fmassa commented 5 years ago

doesn't the current ImageNet format (unzipped images) work for you?

D-X-Y commented 5 years ago

I'm using a system that can not handle many small files (such as 1M PNG images). Therefore, I can not use raw ImageNet images, but have to use one or a few large files to save the whole dataset.

RicCu commented 5 years ago

Something that kind of helped me with Imagenet data loading was using TensorFlow Dataset's from_numpy method, along with TFDS prepackaged ImageNet dataset in TFRecords format. It wasn't ideal and there was room for further optimization for PyTorch ingestion, but it did speed up dataloading a ton on a machine with only HDD available.

charles-loomai commented 4 years ago

I came across the same problems.. too many small files... I need to find a way to speed up the dataloading. tried lmdb lsun.py LMDB example not working. @RicCu could you share more tips of combining TFDS into dataloader?

RicCu commented 4 years ago

Hi, I haven't worked that much more on getting tfds to work great with dataloaders, but you might wanna take a look at @vahidk's tfrecords reader. It has nice interop with PyTorch's dataloaders without depending on TF.