uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.78k stars 285 forks source link

Make `make_spark_converter` supports creating converter from a saved dataframe path #787

Closed WeichenXu123 closed 1 year ago

WeichenXu123 commented 1 year ago

Signed-off-by: Weichen Xu weichen.xu@databricks.com

Make make_spark_converter supports creating converter from a saved dataframe path. In this case, we can skip the step of materializing spark dataframe that might be slow.

codecov[bot] commented 1 year ago

Codecov Report

Base: 86.25% // Head: 86.24% // Decreases project coverage by -0.01% :warning:

Coverage data is based on head (194ed0e) compared to base (42f4af9). Patch coverage: 77.77% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #787 +/- ## ========================================== - Coverage 86.25% 86.24% -0.01% ========================================== Files 84 84 Lines 5078 5090 +12 Branches 785 790 +5 ========================================== + Hits 4380 4390 +10 - Misses 559 560 +1 - Partials 139 140 +1 ``` | [Impacted Files](https://codecov.io/gh/uber/petastorm/pull/787?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber) | Coverage Δ | | |---|---|---| | [petastorm/spark/spark\_dataset\_converter.py](https://codecov.io/gh/uber/petastorm/pull/787?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber#diff-cGV0YXN0b3JtL3NwYXJrL3NwYXJrX2RhdGFzZXRfY29udmVydGVyLnB5) | `91.00% <77.77%> (-0.34%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=uber)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

WeichenXu123 commented 1 year ago

CC @selitvin Could you take a look ? Thank you!

WeichenXu123 commented 1 year ago

CC @selitvin Would you take a look ? Thank you!

WeichenXu123 commented 1 year ago

@selitvin PR updated. Could you take a look again ? Thank you.!

selitvin commented 1 year ago

@WeichenXu123 : when would you like to cut a release?

WeichenXu123 commented 1 year ago

We might have follow-up updates about delta. Let me ask our team members and I will reply to you later.

WeichenXu123 commented 1 year ago

@selitvin We could have a release now. Following updates are planned be added in Q1 and then we need another release.