Closed jdimatteo closed 3 years ago
Also, please advise if there is a more concise / more useful way for me to describe steps to reproduce problems like this.
I confirmed checkpointing works with multiple tables, and I updated my example to do that here: https://github.com/jdimatteo/ge_tutorials/blob/d77b21f27f755c097de53f7c5c2ebef803109889/multiple_tables/bigquery_python_example.py . Example output from this updated script is included in the commit message: https://github.com/jdimatteo/ge_tutorials/commit/d77b21f27f755c097de53f7c5c2ebef803109889
Hi @jdimatteo - thanks for submitting this issue, and thank you as well for the thorough and considered steps to reproduce.
After some initial research, I've concluded that this is, in fact, a bug that happened to be revealed in the validation_operator
example - though it doesn't specifically have anything to do with validation_operators
. In the example with validation_operators
, you are instantiating two Validators
. Since both of these Validators
use the same ExecutionEngine
, and since Validators
are very closely linked with ExecutionEngines
, when you instantiate the second Validator
it calls execution_engine.load_batch_data()
which wipes the active_batch
from the first Validator
. Thus the first Validator
is left with an active_batch_id
but no active_batch
and thus no data immediately accessible.
There are a few potential solutions here. We will be working internally to resolve this issue over the next week or so, and I will post updates here as they come.
@talagluck thanks for the update!
Given that there is a bug being investigated, I closed PR #3172 (where I naively documented that validation operator just doesn't work with the V3 API).
Hi @jdimatteo ! I wanted to update you - we have a PR open to resolve this issue. It's currently breaking a bunch of tests, so I'll be working on resolving those and getting this merged in early next week. Thanks for your help and your patience!
Hi @jdimatteo ! Just an update - this is a little more complex than we thought. We're continuing work on this, and will update you when this is ready for merge.
This should be resolved by #3222 ! Please let us know if anything else arises as a result of this.
Describe the bug When I try to run a validation operator to generate data docs with a second table, errors are raised like
great_expectations.exceptions.exceptions.ExecutionEngineError: Error: The column "station_id" in BatchData does not exist.
If I only validate a single table at a time, no such error occurs, and the error appears to only occur if there are multiple tables being validated. If this is user error, please point to an example in the documentation where multiple tables are validated creating data docs using the V3 api and/or please help me fix my example detailed below.
To Reproduce
git clone --branch jdimatteo/reproduce_second_table_validation_operator_error git@github.com:jdimatteo/ge_tutorials.git && cd ge_tutorials/multiple_tables/
python3.9 -m venv venv && source venv/bin/activate && pip install -r requirements.txt
austin_bikeshare
tablesbikeshare_stations
andbikeshare_trips
to a google cloud project you can run queries against, e.g.bigquery_project
in the bigquery_python_example.py to be your<GCP_PROJECT_NAME>
python bigquery_python_example.py
An exception is thrown as shown in the output of the below:
Expected behavior No exception raised.
Environment (please complete the following information):