uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Use highest available pickle protocol when serializing #737

Closed rbetz closed 2 years ago

rbetz commented 2 years ago

Pickle protocol 5 allows for zero-copy pickling of numpy arrays, but protocol version 4 remains the default for Python 3.8 onwards. When loading datasets with large numpy arrays and reader_pool_type="process", using protocol 5 significantly improves serialization time and overall performance.

Protocol 5 is only available in Python 3.8 and onwards, so for backwards and forwards compatibility with future pickle protocol improvements we specify pickle.HIGHEST_PROTOCOL.

Small test case

import time
import numpy as np
from petastorm.reader_impl.pickle_serializer import PickleSerializer

s = PickleSerializer()
data = [np.random.random((1000, 1000)) for _ in range(10)]
t = time.time()
for x in range(100):
  _ = s.deserialize(s.serialize(data))
print(time.time() - t)

master: 11.40 sec

This PR: 8.78 sec

CLAassistant commented 2 years ago

CLA assistant check
All committers have signed the CLA.

selitvin commented 2 years ago

The change looks good to me. Can you please update docs/release-notes.rst and accept the CLA? Once we land it, I can cut a new release.

codecov[bot] commented 2 years ago

Codecov Report

Merging #737 (5a73d49) into master (76a36fe) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #737   +/-   ##
=======================================
  Coverage   86.27%   86.27%           
=======================================
  Files          85       85           
  Lines        5084     5084           
  Branches      787      787           
=======================================
  Hits         4386     4386           
  Misses        559      559           
  Partials      139      139           
Impacted Files Coverage Δ
petastorm/reader_impl/pickle_serializer.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 76a36fe...5a73d49. Read the comment docs.

rbetz commented 2 years ago

bump... Is there something I need to do to get the unit tests to run?

selitvin commented 2 years ago

bump... Is there something I need to do to get the unit tests to run?

Sorry. Did not notice you were blocked on the tests. I had to approve the run for a user's first time. Should be ok going forward. Also, note there is a flaky test (some test that has to do with reading from a partitioned dataset). Feel free to rerunning the build if you bump into it.

rbetz commented 2 years ago

@selitvin I think I need permission to run the tests again. They did all pass, except for the ones you mentioned were flaky.

selitvin commented 2 years ago

That's weird. I thought it I am expected to approve running tests only for the first time. Will see if I can reconfigure this.

rbetz commented 2 years ago

Thanks for your help. Looks like the tests passed this time, so it's ready to merge :)

selitvin commented 2 years ago

Beautiful! Thank you for the PR.