Ability to apply a row_condition to PySpark DataFrame GE checks

great-expectations / great_expectations

Always know what to expect from your data.

https://docs.greatexpectations.io/

Apache License 2.0

9.98k stars 1.54k forks source link

Ability to apply a row_condition to PySpark DataFrame GE checks #3766

Closed philgeorge999 closed 2 years ago

philgeorge999 commented 2 years ago

I am applying around 400 dq checks to a table with 30M rows of data and 250 columns, and around 25% of these checks only apply to a subset of rows. There is too much data to use Pandas Dataframes. I have noticed that the PySparck DataFrame GE APIs do not take a row_condition argument so I am currently having to pre-filter the dataframe in before each of these checks which means the code is really nasty. And it also means that the percentage results in the output are incorrect as they need to be based on the full table. Which means a lot of custom processing is required to make use of GE.

Is it possible to have a form of row_condition implemented for the Spark APIs?

austiezr commented 2 years ago

Hey @philgeorge999! Thanks for raising this. While we have eyes on this functionality, we aren't able to prioritize work in this area at this time. We would welcome any community contribution towards this feature, and we'd be happy to offer guidance and review; any work in this area would be a great accelerator.

austiezr commented 2 years ago

Hey @philgeorge999! Wanted to bump this again to say that we now support this functionality at an experimental level. There may be some inconsistencies or unexpected behavior at this point, but you should now be able to pass a row_condition, with great_expectations__experimental__ as the condition_parser. Formal documentation is still forthcoming, but a number of users in our Slack Channel are now successfully implementing these row_conditions.

As such, I'm closing out this issue for now; if you run into any issues with behavior & implementation, feel free to open another thread!

pishGold commented 2 days ago

hey, i'm having trouble with the row condition functionality consider this example: expectation = gx.expectations.ExpectColumnValuesToBeBetween( column="x", min_value=1, max_value=5, condition_parser='great_expectations', row_condition='col("y").notNull()' )

when running this expectation on a spark dataframe, it throws an exception "could not resolve column 'x'".

i've double and triple checked to ensure that this exception occurs event hough both 'x' and 'y' columns  exist in the dataframe.

i've followed the exact instructions in the documentation but couldn't resolve the issue.