great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.8k stars 1.51k forks source link

unexpected_rows not being reported for MulticolumnMapExpectation or ColumnPairMapExpectation #6608

Closed pablofelt closed 2 weeks ago

pablofelt commented 1 year ago

Describe the bug When I run a checkpoint with

"result_format":{
        "result_format": "COMPLETE",
        "include_unexpected_rows":True,
    }

subclasses of MulticolumnMapExpectation and ColumnPairMapExpectation both return unexpected_rows=None instead of returning the list of rows that failed validation.

ColumnMapExpectation subclasses work fine (they return unexpected_rows=[...] as expected).

To Reproduce Steps to reproduce the behavior:

  1. Create a MulticolumnMapExpectation or ColumnPairMapExpectation expectation, e.g., expect_column_pair_values_a_to_be_greater_than_b
{
  "meta": {
    "great_expectations_version": "0.15.26"
  },
  "expectation_suite_name": "simple-expectation-suite",
  "data_asset_type": null,
  "expectations": [
    {
      "meta": {},
      "expectation_type": "expect_column_pair_values_a_to_be_greater_than_b",
      "kwargs": {
        "column_A": "COPAYAMT",
        "column_B": "NETPAY",
        "expectation_type": "expect_column_pair_values_a_to_be_greater_than_b",
        "ignore_row_if": "either_value_is_missing",
        "mostly": 1.0,
        "name": "e192b8f5-e577-4547-abf3-6dca95c1563e",
        "or_equal": false,
        "row_condition": null,
        "version": "1.0"
      }
    }
  ],
  "ge_cloud_id": null
}
  1. Run it in a checkpoint with include_unexpected_rows=true, e.g.,
    
    {
    "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "store_evaluation_params",
      "action": {
        "class_name": "StoreEvaluationParametersAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction",
        "site_names": []
      }
    }
    ],
    "batch_request": {},
    "class_name": "Checkpoint",
    "config_version": 1.0,
    "evaluation_parameters": {},
    "module_name": "great_expectations.checkpoint",
    "name": "checkpoint-None-66438f03-4ea6-42db-8c86-02cd012e6213",
    "profilers": [],
    "run_name_template": "%Y%m%d-%H%M%S-compiled-checkpoint-template",
    "runtime_configuration": {
    "result_format": {
      "result_format": "COMPLETE",
      "partial_unexpected_count": 20,
      "include_unexpected_rows": true
    }
    },
    "validations": []
    }


**Expected behavior**
validation results should set `unexpected_rows` to a list of all the rows that failed validation as documented here: https://docs.greatexpectations.io/docs/reference/expectations/result_format. Instead we get `unexpected_rows: None`.

**Environment (please complete the following information):**
 - ubuntu:20.04 running in docker desktop on MacOS
 - Great Expectations Version: 0.15.26

**Additional context**
- When I compare the working ColumnMapExpectation and the not working multi/pair variants, it appears that `...unexpected_rows` metrics are being computed in all cases, but the multi/pair variants just don't report the info. 

Here's where the (correctly functioning) ColumnMapExpectation._validate() passes `unexpected_rows` to its `_format_map_output()` call: https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/expectations/expectation.py#L2717

The corresponding line for the (not working) ColumnPairMapExpectation._validate() does not report unexpected_rows, even though a debugger shows that the `[metric.name].unexpected_rows` metric was computed and is available: https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/expectations/expectation.py#L2930

And neither does the MulticolumnMapExpectation._validate(): https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/expectations/expectation.py#L3147

I can work around this in our code--but seemed like maybe an easy-to-fix thing on your side you might want to know about. Thanks for maintaining this great library!
dctalbot commented 1 year ago

Hi, my findings are similar. Here is another way to reproduce the issue:

import great_expectations as gx

context = gx.get_context()
validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

# column map expectation
example1 = validator.expect_column_values_to_not_be_null(
    "congestion_surcharge",
    result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True,
    },
)

assert example1.result["unexpected_rows"] is not None

# column_pair_map_expectation
example2 = validator.expect_column_pair_values_a_to_be_greater_than_b(
    "congestion_surcharge",
    "total_amount",
    ignore_row_if="either_value_is_missing",
    mostly=1.0,
    or_equal=False,
    row_condition=None,
    result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True,
    },
)

assert example2.result["unexpected_rows"] is not None

# column_aggregate_expectation
example3 = validator.expect_column_mean_to_be_between(
    "vendor_id",
    min_value=0.0,
    max_value=1.0,
    result_format={
        "result_format": "COMPLETE",
        "include_unexpected_rows": True,
    },
)

assert example3.result["unexpected_rows"] is not None

checkpoint = gx.checkpoint.SimpleCheckpoint(
    name="checkpoint-None-66438f03-4ea6-42db-8c86-02cd012e6213",
    run_name_template="%Y%m%d-%H%M%S-compiled-checkpoint-template",
    data_context=context,
    validator=validator,
    runtime_configuration={
        "result_format": {
            "result_format": "COMPLETE",
            "include_unexpected_rows": True,
        }
    },
    validations=[],
)

checkpoint_result = checkpoint.run()
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]

for x in checkpoint_result.run_results[next(iter(checkpoint_result.run_results))]["validation_result"].results:
    assert x.result["unexpected_rows"] is not None, x.expectation_config["expectation_type"]

AFAIK the assertions above should pass

molliemarie commented 2 weeks ago

Hello @pablofelt. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗