Closed philgeorge999 closed 2 years ago
Hey @philgeorge999! Thanks for raising this. While we have eyes on this functionality, we aren't able to prioritize work in this area at this time. We would welcome any community contribution towards this feature, and we'd be happy to offer guidance and review; any work in this area would be a great accelerator.
Hey @philgeorge999! Wanted to bump this again to say that we now support this functionality at an experimental level. There may be some inconsistencies or unexpected behavior at this point, but you should now be able to pass a row_condition
, with great_expectations__experimental__
as the condition_parser
. Formal documentation is still forthcoming, but a number of users in our Slack Channel are now successfully implementing these row_condition
s.
As such, I'm closing out this issue for now; if you run into any issues with behavior & implementation, feel free to open another thread!
hey, i'm having trouble with the row condition functionality consider this example: expectation = gx.expectations.ExpectColumnValuesToBeBetween( column="x", min_value=1, max_value=5, condition_parser='great_expectations', row_condition='col("y").notNull()' )
when running this expectation on a spark dataframe, it throws an exception "could not resolve column 'x'".
i've double and triple checked to ensure that this exception occurs event hough both 'x' and 'y' columns exist in the dataframe.
i've followed the exact instructions in the documentation but couldn't resolve the issue.
I am applying around 400 dq checks to a table with 30M rows of data and 250 columns, and around 25% of these checks only apply to a subset of rows. There is too much data to use Pandas Dataframes. I have noticed that the PySparck DataFrame GE APIs do not take a row_condition argument so I am currently having to pre-filter the dataframe in before each of these checks which means the code is really nasty. And it also means that the percentage results in the output are incorrect as they need to be based on the full table. Which means a lot of custom processing is required to make use of GE.
Is it possible to have a form of row_condition implemented for the Spark APIs?