uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

In-memory cache #555

Closed abditag2 closed 3 years ago

abditag2 commented 4 years ago

This PR adds a in-memory cache to the Petastorm loader. This enables very fast reading of data and removing the IO/netowrk bottleneck.

codecov[bot] commented 4 years ago

Codecov Report

Merging #555 (8f00aaf) into master (7377bb7) will increase coverage by 0.03%. The diff coverage is 89.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #555      +/-   ##
==========================================
+ Coverage   85.32%   85.35%   +0.03%     
==========================================
  Files          85       85              
  Lines        4933     4978      +45     
  Branches      783      790       +7     
==========================================
+ Hits         4209     4249      +40     
- Misses        584      589       +5     
  Partials      140      140              
Impacted Files Coverage Δ
petastorm/reader_impl/pytorch_shuffling_buffer.py 92.80% <61.53%> (-3.63%) :arrow_down:
petastorm/reader.py 89.47% <87.50%> (+0.15%) :arrow_up:
petastorm/pytorch.py 92.22% <100.00%> (+1.49%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7377bb7...8f00aaf. Read the comment docs.

fps7806 commented 4 years ago

Can you explain what exactly is being cached? From a quick glance I couldn't figure out the difference between the two dataloaders.

abditag2 commented 4 years ago

@fps7806 Caching works by preserving the values loaded into the ShufflingBuffer into the memory or GPU and not removing them. This way, if in-memory-cache is enabled, the values will only be read once from disk/network into memory. Each worker only caches its own shard of the data and not the entire data set.