uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Parallelize encoding of a single row #546

Open selitvin opened 4 years ago

selitvin commented 4 years ago

When writing data into a petastorm dataset. Before a pyspark sql.Row object is created, fields containing data that is not natively supported by Parqyet format, such as numpy arrays, are serialized into byte arrays. Images maybe compressed using png or jpeg compression.

Serializing fields on a thread pool speeds up this process in some cases (e.g. a row contains multiple images).

codecov[bot] commented 4 years ago

Codecov Report

Base: 82.88% // Head: 85.99% // Increases project coverage by +3.11% :tada:

Coverage data is based on head (3fe68d4) compared to base (83a02df). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #546 +/- ## ========================================== + Coverage 82.88% 85.99% +3.11% ========================================== Files 85 87 +2 Lines 4721 4935 +214 Branches 744 783 +39 ========================================== + Hits 3913 4244 +331 + Misses 678 568 -110 + Partials 130 123 -7 ``` | [Impacted Files](https://codecov.io/gh/uber/petastorm/pull/546?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [petastorm/unischema.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3VuaXNjaGVtYS5weQ==) | `96.91% <100.00%> (+1.12%)` | :arrow_up: | | [petastorm/reader\_impl/pytorch\_shuffling\_buffer.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3JlYWRlcl9pbXBsL3B5dG9yY2hfc2h1ZmZsaW5nX2J1ZmZlci5weQ==) | `96.42% <0.00%> (ø)` | | | [petastorm/benchmark/dummy\_reader.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL2JlbmNobWFyay9kdW1teV9yZWFkZXIucHk=) | `0.00% <0.00%> (ø)` | | | [petastorm/py\_dict\_reader\_worker.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3B5X2RpY3RfcmVhZGVyX3dvcmtlci5weQ==) | `95.23% <0.00%> (+0.79%)` | :arrow_up: | | [petastorm/spark/spark\_dataset\_converter.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3NwYXJrL3NwYXJrX2RhdGFzZXRfY29udmVydGVyLnB5) | `91.76% <0.00%> (+1.49%)` | :arrow_up: | | [petastorm/pytorch.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3B5dG9yY2gucHk=) | `94.21% <0.00%> (+1.53%)` | :arrow_up: | | [petastorm/arrow\_reader\_worker.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL2Fycm93X3JlYWRlcl93b3JrZXIucHk=) | `92.00% <0.00%> (+2.00%)` | :arrow_up: | | [petastorm/compat.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL2NvbXBhdC5weQ==) | `100.00% <0.00%> (+39.02%)` | :arrow_up: | | [...\_dataset\_converter/tests/test\_converter\_example.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-ZXhhbXBsZXMvc3BhcmtfZGF0YXNldF9jb252ZXJ0ZXIvdGVzdHMvdGVzdF9jb252ZXJ0ZXJfZXhhbXBsZS5weQ==) | `100.00% <0.00%> (+46.66%)` | :arrow_up: | | [examples/spark\_dataset\_converter/utils.py](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-ZXhhbXBsZXMvc3BhcmtfZGF0YXNldF9jb252ZXJ0ZXIvdXRpbHMucHk=) | `100.00% <0.00%> (+62.50%)` | :arrow_up: | | ... and [2 more](https://codecov.io/gh/uber/petastorm/pull/546/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

CLAassistant commented 1 year ago

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.