Closed filipeo2-mck closed 3 weeks ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 83.15%. Comparing base (
4df61da
) to head (f011ea7
). Report is 73 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
~Hey @cosmicBboy , not sure why the CI broke. I don't have permissions to restart it from the failed one:~
@filipeo2-mck yeah I can manually restart these. Need to figure out why the hashes don't match... I see this fairly often with other PRs.
@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!
@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!
LGTM!
@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!
sorry @cosmicBboy I missed github notification about this message. I will try to review it soon. Thanks
this looks great, @filipeo2-mck As discussed, please add a negative and positive tests with dummy data to explain the situation you are fixing.
Hi @NeerajMalhotra-QB ! The suggested test cases were added, showing the Pandera usage I'm trying to enable with this PR, along negative test cases. A screenshot was also added to this PR description. Thanks for your suggestions 👍
Hello @cosmicBboy! Approvals were granted, happy if you can evaluate and/or merge it :) Thank you!
hey @filipeo2-mck would you mind rebasing on main
? It should address the failing unit test
hey @filipeo2-mck would you mind rebasing on
main
? It should address the failing unit test
Done, @cosmicBboy , I hope that everything is OK now :) Thank you!
Looks like test is failing: https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540
You can test this locally by running the nox test:
nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"
(You need nox
and mamba
installed)
Looks like test is failing: https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540
You can test this locally by running the nox test:
nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"
(You need
nox
andmamba
installed)
Hi @cosmicBboy ! Sorry for the delay. I took a look at the CI run:
it looks like it's happening only with the Windows runners (linux and macos ran fine with this config):
This Windows task hanged for 1 hour+ and ended with a HADOOP_HOME unset
error, probably an issue with Spark installation:
I don't have a Windows machine to test it locally and, as I'm using pytest
's tmp_dir
functionality to save the temporary file, I don't see what could be wrong with PR code.
Would you mind to rerun the CI from start one time, only to make sure it's not a transient issue with GH Windows runners, please?
Thanks for the contribution @filipeo2-mck !
Using Pandera schemas for data quality checks are great but we need to get the PySpark DataFrame correctly loaded first, with the correct column types we are expecting to validate later. Relying on automatic Spark's
inferSchema = True
when loading data files (CSV and parquet, for example) is not reliable, so this PR tries to address this by allowing the extraction of a PySpark schema from existing Pandera schemas/models, in two ways:StructType
objectA more compact/simple DDL-like schema:
Both extractions above can be used to create or read files in Spark, as in these examples:
Creating a dataframe:
Reading an existing file:
Specific tests for these were added, representing most common scenarios/datatypes used. The output of the unit test
test_pyspark_read
shows the default behavior between reading a sample CSV file with schema inference (non-deterministic) and using the approach enabled by this PR (deterministic):This PR tries to address both open issues: #1327 and #1434.