Add StructType and DDL extraction from Pandera schemas

filipeo2-mck commented 1 month ago

Using Pandera schemas for data quality checks are great but we need to get the PySpark DataFrame correctly loaded first, with the correct column types we are expecting to validate later. Relying on automatic Spark's inferSchema = True when loading data files (CSV and parquet, for example) is not reliable, so this PR tries to address this by allowing the extraction of a PySpark schema from existing Pandera schemas/models, in two ways:

A StructType object
A more compact/simple DDL-like schema:
```
binary BINARY,byte TINYINT,text STRING
```

Both extractions above can be used to create or read files in Spark, as in these examples:

Creating a dataframe:

spark.createDataFrame([], schema)  # be `schema` a StructType or a DDL-like string

Reading an existing file:

customSchema = StructType([
    StructField("IDGC", StringType(), True),        
    StructField("SEARCHNAME", StringType(), True),
    StructField("PRICE", DoubleType(), True)
])
df = spark.read.load('/file.csv', format="csv", schema=customSchema)

Specific tests for these were added, representing most common scenarios/datatypes used. The output of the unit test test_pyspark_read shows the default behavior between reading a sample CSV file with schema inference (non-deterministic) and using the approach enabled by this PR (deterministic):

This PR tries to address both open issues: #1327 and #1434.

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 83.15%. Comparing base (4df61da) to head (f011ea7). Report is 73 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1570 +/- ## =========================================== - Coverage 94.29% 83.15% -11.14% =========================================== Files 91 114 +23 Lines 7024 8505 +1481 =========================================== + Hits 6623 7072 +449 - Misses 401 1433 +1032 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

filipeo2-mck commented 1 month ago

~Hey @cosmicBboy , not sure why the CI broke. I don't have permissions to restart it from the failed one:~

cosmicBboy commented 1 month ago

@filipeo2-mck yeah I can manually restart these. Need to figure out why the hashes don't match... I see this fairly often with other PRs.

cosmicBboy commented 1 month ago

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

jaskaransinghsidana commented 1 month ago

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

LGTM!

NeerajMalhotra-QB commented 1 month ago

@NeerajMalhotra-QB @jaskaransinghsidana would appreciate your review on this PR!

sorry @cosmicBboy I missed github notification about this message. I will try to review it soon. Thanks

NeerajMalhotra-QB commented 1 month ago

this looks great, @filipeo2-mck As discussed, please add a negative and positive tests with dummy data to explain the situation you are fixing.

filipeo2-mck commented 1 month ago

Hi @NeerajMalhotra-QB ! The suggested test cases were added, showing the Pandera usage I'm trying to enable with this PR, along negative test cases. A screenshot was also added to this PR description. Thanks for your suggestions 👍

filipeo2-mck commented 4 weeks ago

Hello @cosmicBboy! Approvals were granted, happy if you can evaluate and/or merge it :) Thank you!

cosmicBboy commented 3 weeks ago

hey @filipeo2-mck would you mind rebasing on main? It should address the failing unit test

filipeo2-mck commented 3 weeks ago

hey @filipeo2-mck would you mind rebasing on main? It should address the failing unit test

Done, @cosmicBboy , I hope that everything is OK now :) Thank you!

cosmicBboy commented 3 weeks ago

Looks like test is failing: https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540

You can test this locally by running the nox test:

nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"

(You need nox and mamba installed)

filipeo2-mck commented 3 weeks ago

Looks like test is failing: https://github.com/unionai-oss/pandera/actions/runs/8815578745/job/24197845359?pr=1570#step:15:1540

You can test this locally by running the nox test:
nox -db mamba --envdir .nox-mamba -s "tests(extra='pyspark', pydantic='2.3.0', python='3.10', pandas='2.2.0')"
(You need nox and mamba installed)

Hi @cosmicBboy ! Sorry for the delay. I took a look at the CI run:

it looks like it's happening only with the Windows runners (linux and macos ran fine with this config):
This Windows task hanged for 1 hour+ and ended with a HADOOP_HOME unset error, probably an issue with Spark installation:
I don't have a Windows machine to test it locally and, as I'm using pytest's tmp_dir functionality to save the temporary file, I don't see what could be wrong with PR code.

Would you mind to rerun the CI from start one time, only to make sure it's not a transient issue with GH Windows runners, please?

cosmicBboy commented 3 weeks ago

Thanks for the contribution @filipeo2-mck !

unionai-oss / pandera

Add StructType and DDL extraction from Pandera schemas #1570

Codecov Report