pbamotra / basicml

Throwing a 'what' to everything machine leanring

https://basicml.com

MIT License

0 stars 0 forks source link

performance/2019/05/18/efficiently-storing-and-retrieving-image-datasets #4

Open utterances-bot opened 4 years ago

utterances-bot commented 4 years ago

Efficiently processing large image datasets in Python | Basic Machine Learning

I have been working on Computer Vision projects for some time now and moving from NLP domain the first thing I realized was that image datasets are yuge! I typically process 500GiB to 1TB of data at a time while training deep learning models. Out of the box, I rely on using ImageFolder class of Pytorch but disk reads are so slow (innit?). I was reading through open source projects to see how people efficiently process large image data sets like Places. That’s how I stumbled into LMDB store which is the focus of this post. The tagline on the official project page justifies the benefits of using LMDB: -

https://www.basicml.com/performance/2019/05/18/efficiently-storing-and-retrieving-image-datasets.html

RomainSabathe commented 4 years ago

Hey, thanks for the nice tutorial. I wonder what your thoughts are on wrapper tools like https://github.com/vicolab/ml-pyxis ? They seem to avoid the boilerplate of manually handling the transactions. Also, have you found a best practice to split the data into train/val/test? The options I am considering are:

Create 1 LMDB file per train/val/test. Probably the simplest to do. But it might become complicated to do label cleaning down the road.
Have a metadatafile (a dict-like pickled object, or pandas dataframe) that given a key retrieves the label, split etc. The same key can be used in the LMDB dataset to retrieve the image.

pbamotra commented 4 years ago

Thanks for pointing ml-pyaxis. I've not yet seen this repo but in general for large datasets the most promising approach is lazy loading. LMDB is just one of the tools to achieve that. One additional thing to consider is that for practical purposes you would want to shuffle the data after every epoch (not necessarily the whole dataset though), so the implementation and associate data store should allow that. My training pipeline typically looks like - clean_data -> transform data -> chunk -> create batches -> train.

tmoorecan commented 4 years ago

Thanks for the great tutorial. What's the license on this code in case I want to directly use some of it?

pbamotra commented 4 years ago

Added the MIT License to the post.

Ushk commented 4 years ago

Sorry for resurrecting this post a year after you wrote it!

After reading this tutorial I am trying to reconcile it with this issue: https://github.com/jnwatson/py-lmdb/issues/195 . I am finding that when I use this code for a small dataset (~1GB), the txn.get time is ~0.1x simply reading the data (in my case images, so cv2.imread), while for a large dataset (~100GB) it is x2-5 , with simple read times remaining constant.

Is this something you've seen as well, and can you offer any pointers on addressing it if so?

pbamotra commented 4 years ago

Please see - https://github.com/jnwatson/py-lmdb/issues/195#issuecomment-502542464