monniert / docExtractor

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
https://www.tmonnier.com/docExtractor
MIT License
85 stars 10 forks source link

[suggestion] loading the data on the fly #8

Closed seekingdeep closed 3 years ago

seekingdeep commented 3 years ago

A good suggestion would be an option of loading the data on the fly, which means instead of loading all the images for training/ predicting all-at-once into the memory, we only load the data as portions. Even-though this option might increase the time to process the data, it is surely beneficial when dealing with huge number of images for training/predicting, and thus ensuring the ability to deal with huge datasets while maintaining the ram usage.

examples: --train_on_fly_images 100: 100 images to be loaded into the ram at a single time --train_on_fly_json 10: a complete 10 json files to be loaded into the ram at a single time Note:train_on_fly_json` would be used when having multiple .json files for training. a single .json file can containing multiple images.

monniert commented 3 years ago

Thanks for the suggestion, but I am not sure to understand as images are already loaded on the fly. During the dataset initialization, only the image paths are loaded in memory (src.datasets.segmentation line45). Images are then loaded on the fly at each __getitem__ call

seekingdeep commented 3 years ago

So does it load a batch of files and train, then it flushes them from memory, and then load another batch of files? This is important because i might have a huge dataset, and how memory is utilized is very important.

monniert commented 3 years ago

yes that's what it does following standard pytorch dataloader routines. So I close the issue for now, please reopen in the case you meant something elese