great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.92k stars 1.54k forks source link

V3 API Validation Operator Doesn't Support More Than One Table #3168

Closed jdimatteo closed 3 years ago

jdimatteo commented 3 years ago

Describe the bug When I try to run a validation operator to generate data docs with a second table, errors are raised like great_expectations.exceptions.exceptions.ExecutionEngineError: Error: The column "station_id" in BatchData does not exist.

If I only validate a single table at a time, no such error occurs, and the error appears to only occur if there are multiple tables being validated. If this is user error, please point to an example in the documentation where multiple tables are validated creating data docs using the V3 api and/or please help me fix my example detailed below.

To Reproduce

  1. git clone --branch jdimatteo/reproduce_second_table_validation_operator_error git@github.com:jdimatteo/ge_tutorials.git && cd ge_tutorials/multiple_tables/
  2. python3.9 -m venv venv && source venv/bin/activate && pip install -r requirements.txt
  3. Copy the public datasetaustin_bikeshare tables bikeshare_stations and bikeshare_trips to a google cloud project you can run queries against, e.g.
    bq mk <GCP_PROJECT_NAME>:austin_bikeshare
    bq cp bigquery-public-data:austin_bikeshare.bikeshare_stations <GCP_PROJECT_NAME>:austin_bikeshare.bikeshare_stations
    bq cp bigquery-public-data:austin_bikeshare.bikeshare_trips <GCP_PROJECT_NAME>:austin_bikeshare.bikeshare_trips
  4. Update bigquery_project in the bigquery_python_example.py to be your <GCP_PROJECT_NAME>
  5. python bigquery_python_example.py

An exception is thrown as shown in the output of the below:

(venv) jdimatteo@dimo:~/dev/ge_tutorials_fork/multiple_tables$ time python bigquery_python_example.py
2021-08-03 09:02:54,387 INFO     Deleting any prior html...
2021-08-03 09:02:54,387 INFO     Creating context...
2021-08-03 09:02:54,515 INFO     Usage statistics is disabled; skipping initialization.
2021-08-03 09:02:54,525 INFO     Adding expectation suite...
2021-08-03 09:02:54,529 INFO     Adding validation operator...
2021-08-03 09:02:54,533 INFO     Getting validator...
2021-08-03 09:02:54,538 INFO     Running expect_table_columns_to_match_set
2021-08-03 09:02:54,548 WARNING  /home/jdimatteo/.local/lib/python3.9/site-packages/google/cloud/bigquery/client.py:504: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
  warnings.warn(

expect_table_columns_to_match_set passed? True
expect_table_columns_to_match_set observed value: ['station_id', 'name', 'status', 'address', 'alternate_name', 'city_asset_number', 'property_type', 'number_of_docks', 'power_type', 'footprint_length', 'footprint_width', 'notes', 'council_district', 'modified_date']
2021-08-03 09:02:54,994 INFO     Running not null expectation...
Calculating Metrics:  25%|█████████████████████████████████████████████████████████████████▌                                                                                                                                                                                                    | 2/8 [00:00<00:00,  8.08it/s]2021-08-03 09:02:55,259 INFO     SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `bikeshare_stations` 
WHERE true) AS `great_expectations_sub_selection`
2021-08-03 09:02:55,260 INFO     [dialect bigquery+bigquery does not support caching 0.00059s] {}
Calculating Metrics:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                   | 4/8 [00:02<00:02,  1.62it/s]2021-08-03 09:02:57,189 INFO     SELECT `station_id` AS `unexpected_values` 
FROM (SELECT * 
FROM `bikeshare_stations` 
WHERE true) AS `great_expectations_sub_selection` 
WHERE `station_id` IS NULL
 LIMIT %(param_1:INT64)s
2021-08-03 09:02:57,189 INFO     [dialect bigquery+bigquery does not support caching 0.00052s] {'param_1': 20}
Calculating Metrics:  88%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                | 7/8 [00:03<00:00,  1.88it/s]2021-08-03 09:02:58,578 INFO     SELECT sum(CASE WHEN (`station_id` IS NULL) THEN %(param_1:INT64)s ELSE %(param_2:INT64)s END) AS `column_values_nonnull_unexpected_count` 
FROM (SELECT * 
FROM `bikeshare_stations` 
WHERE true) AS `great_expectations_sub_selection`
2021-08-03 09:02:58,578 INFO     [dialect bigquery+bigquery does not support caching 0.00031s] {'param_1': 1, 'param_2': 0}
Calculating Metrics: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00,  1.75it/s]
Not null expectation passed? True
2021-08-03 09:02:59,585 INFO     Running not null expectation...
Calculating Metrics:  25%|█████████████████████████████████████████████████████████████████▌                                                                                                                                                                                                    | 2/8 [00:00<00:00,  9.73it/s]2021-08-03 09:02:59,802 INFO     SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `bikeshare_trips` 
WHERE true) AS `great_expectations_sub_selection`
2021-08-03 09:02:59,802 INFO     [dialect bigquery+bigquery does not support caching 0.00033s] {}
Calculating Metrics:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                   | 4/8 [00:01<00:01,  3.25it/s]2021-08-03 09:03:00,713 INFO     SELECT `trip_id` AS `unexpected_values` 
FROM (SELECT * 
FROM `bikeshare_trips` 
WHERE true) AS `great_expectations_sub_selection` 
WHERE `trip_id` IS NULL
 LIMIT %(param_1:INT64)s
2021-08-03 09:03:00,714 INFO     [dialect bigquery+bigquery does not support caching 0.00067s] {'param_1': 20}
Calculating Metrics:  88%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                | 7/8 [00:02<00:00,  2.44it/s]2021-08-03 09:03:02,183 INFO     SELECT sum(CASE WHEN (`trip_id` IS NULL) THEN %(param_1:INT64)s ELSE %(param_2:INT64)s END) AS `column_values_nonnull_unexpected_count` 
FROM (SELECT * 
FROM `bikeshare_trips` 
WHERE true) AS `great_expectations_sub_selection`
2021-08-03 09:03:02,183 INFO     [dialect bigquery+bigquery does not support caching 0.00059s] {'param_1': 1, 'param_2': 0}
Calculating Metrics: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.32it/s]
Not null expectation passed with second table? True
2021-08-03 09:03:03,050 INFO     Running operator so data docs created...
2021-08-03 09:03:03,058 INFO     Setting run_name to: 20210803T150303.058648Z
2021-08-03 09:03:03,059 INFO        2 expectation(s) included in expectation_suite.
Calculating Metrics:  25%|█████████████████████████████████████████████████████████████████▌                                                                                                                                                                                                    | 2/8 [00:00<00:00, 10.35it/s]2021-08-03 09:03:03,277 INFO     SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `bikeshare_trips` 
WHERE true) AS `great_expectations_sub_selection`
2021-08-03 09:03:03,277 INFO     [dialect bigquery+bigquery does not support caching 0.00019s] {}
Calculating Metrics:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                   | 4/8 [00:01<00:01,  3.28it/s]Traceback (most recent call last):
  File "/home/jdimatteo/dev/ge_tutorials_fork/multiple_tables/bigquery_python_example.py", line 150, in <module>
    main()
  File "/home/jdimatteo/dev/ge_tutorials_fork/multiple_tables/bigquery_python_example.py", line 146, in main
    run_expectations(context)
  File "/home/jdimatteo/dev/ge_tutorials_fork/multiple_tables/bigquery_python_example.py", line 105, in run_expectations
    context.run_validation_operator(validation_operator_name=validation_operator_name, assets_to_validate=[first_validator, second_validator])
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/core/usage_statistics/usage_statistics.py", line 262, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/data_context/data_context.py", line 1356, in run_validation_operator
    return validation_operator.run(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/validation_operators/validation_operators.py", line 326, in run
    batch_validation_result = batch.validate(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 1205, in validate
    results = self.graph_validate(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 474, in graph_validate
    metrics = self.resolve_validation_graph(graph, metrics, runtime_configuration)
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 515, in resolve_validation_graph
    self._resolve_metrics(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 559, in _resolve_metrics
    return execution_engine.resolve_metrics(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py", line 281, in resolve_metrics
    resolved_metrics[metric_to_resolve.id] = metric_fn(
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/expectations/metrics/metric_provider.py", line 57, in inner_func
    return metric_fn(*args, **kwargs)
  File "/home/jdimatteo/.local/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric.py", line 383, in inner_func
    raise ge_exceptions.ExecutionEngineError(
great_expectations.exceptions.exceptions.ExecutionEngineError: Error: The column "station_id" in BatchData does not exist.
Calculating Metrics:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                                                   | 4/8 [00:01<00:01,  3.17it/s]

real    0m11.441s
user    0m2.201s
sys 0m1.608s

Expected behavior No exception raised.

Environment (please complete the following information):

jdimatteo commented 3 years ago

Also, please advise if there is a more concise / more useful way for me to describe steps to reproduce problems like this.

jdimatteo commented 3 years ago

I confirmed checkpointing works with multiple tables, and I updated my example to do that here: https://github.com/jdimatteo/ge_tutorials/blob/d77b21f27f755c097de53f7c5c2ebef803109889/multiple_tables/bigquery_python_example.py . Example output from this updated script is included in the commit message: https://github.com/jdimatteo/ge_tutorials/commit/d77b21f27f755c097de53f7c5c2ebef803109889

talagluck commented 3 years ago

Hi @jdimatteo - thanks for submitting this issue, and thank you as well for the thorough and considered steps to reproduce.

After some initial research, I've concluded that this is, in fact, a bug that happened to be revealed in the validation_operator example - though it doesn't specifically have anything to do with validation_operators. In the example with validation_operators, you are instantiating two Validators. Since both of these Validators use the same ExecutionEngine, and since Validators are very closely linked with ExecutionEngines, when you instantiate the second Validator it calls execution_engine.load_batch_data() which wipes the active_batch from the first Validator. Thus the first Validator is left with an active_batch_id but no active_batch and thus no data immediately accessible.

There are a few potential solutions here. We will be working internally to resolve this issue over the next week or so, and I will post updates here as they come.

jdimatteo commented 3 years ago

@talagluck thanks for the update!

Given that there is a bug being investigated, I closed PR #3172 (where I naively documented that validation operator just doesn't work with the V3 API).

talagluck commented 3 years ago

Hi @jdimatteo ! I wanted to update you - we have a PR open to resolve this issue. It's currently breaking a bunch of tests, so I'll be working on resolving those and getting this merged in early next week. Thanks for your help and your patience!

talagluck commented 3 years ago

Hi @jdimatteo ! Just an update - this is a little more complex than we thought. We're continuing work on this, and will update you when this is ready for merge.

talagluck commented 3 years ago

This should be resolved by #3222 ! Please let us know if anything else arises as a result of this.