abditag2 commented 4 years ago

This PR adds a in-memory cache to the Petastorm loader. This enables very fast reading of data and removing the IO/netowrk bottleneck.

codecov[bot] commented 4 years ago

Codecov Report

Merging #555 (8f00aaf) into master (7377bb7) will increase coverage by 0.03%. The diff coverage is 89.28%.

@@            Coverage Diff             @@
##           master     #555      +/-   ##
==========================================
+ Coverage   85.32%   85.35%   +0.03%     
==========================================
  Files          85       85              
  Lines        4933     4978      +45     
  Branches      783      790       +7     
==========================================
+ Hits         4209     4249      +40     
- Misses        584      589       +5     
  Partials      140      140

Impacted Files	Coverage Δ
petastorm/reader_impl/pytorch_shuffling_buffer.py	`92.80% <61.53%> (-3.63%)`	:arrow_down:
petastorm/reader.py	`89.47% <87.50%> (+0.15%)`	:arrow_up:
petastorm/pytorch.py	`92.22% <100.00%> (+1.49%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7377bb7...8f00aaf. Read the comment docs.

fps7806 commented 4 years ago

Can you explain what exactly is being cached? From a quick glance I couldn't figure out the difference between the two dataloaders.

abditag2 commented 4 years ago

@fps7806 Caching works by preserving the values loaded into the ShufflingBuffer into the memory or GPU and not removing them. This way, if in-memory-cache is enabled, the values will only be read once from disk/network into memory. Each worker only caches its own shard of the data and not the entire data set.

uber / petastorm

In-memory cache #555

Codecov Report