great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.8k stars 1.51k forks source link

Validator throws unexpected `KeyError` when using `sampling` #6463

Closed KentonParton closed 2 weeks ago

KentonParton commented 1 year ago

Describe the bug When using a Validator with sampling present and an incorrect column_name is given to one of the sampling_methods, an unexpected KeyError is thrown instead of the SQLAlchemy exception psycopg2.errors.UndefinedColumn: column "p_sizee" does not exist.

To Reproduce Steps to reproduce the behaviour: Please replace connection_string, schema_name, table_name with valid values.

from great_expectations.core import ExpectationSuite
from great_expectations.core.batch import BatchRequest
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import DataContextConfig, AnonymizedUsageStatisticsConfig
from great_expectations.data_context.types.base import InMemoryStoreBackendDefaults

datasource_name = "my_data_source"
connection_string = "postgresql+psycopg2://postgres:postgres@localhost:5432/postgres"
schema_name = "sample_data"
table_name = "part"

data_context_config = DataContextConfig(
    datasources={
        datasource_name: {
            "class_name": "Datasource",
            "module_name": "great_expectations.datasource",
            "execution_engine": {
                "class_name": "SqlAlchemyExecutionEngine",
                "connection_string": connection_string,
            },
            "data_connectors": {
                "default_configured_asset_data_connector_name": {
                    "class_name": "ConfiguredAssetSqlDataConnector",
                    "assets": {
                        table_name: {
                            "include_schema_name": True,
                            "schema_name": schema_name,
                            "table_name": table_name,
                        }
                    }
                },
            }
        }
    },
    store_backend_defaults=InMemoryStoreBackendDefaults(),
    concurrency={
        "enabled": True,
    },
    anonymous_usage_statistics=AnonymizedUsageStatisticsConfig(enabled=False)
)

context = BaseDataContext(project_config=data_context_config)

batch_request = BatchRequest(
    datasource_name=datasource_name,
    data_connector_name="default_configured_asset_data_connector_name",
    data_asset_name=f"{schema_name}.{table_name}",
    batch_spec_passthrough={
        "create_temp_table": False,
        "sampling_method": "sample_using_mod",
        "sampling_kwargs": {
            "column_name": "incorrect_column_name",
            "mod": "2",
            "value": "1",
        }
    },
)
suite: ExpectationSuite = context.create_expectation_suite("default", overwrite_existing=True)

validator = context.get_validator(
    batch_request=batch_request, expectation_suite=suite
)

head = validator.head()
print(head)

Expected behavior I would expect the SQLAlchemy exception psycopg2.errors.UndefinedColumn: column "p_sizee" does not exist to be raised instead of a KeyError.

Environment (please complete the following information):

austiezr commented 1 year ago

Hey @KentonParton! Thanks for reaching out. Are you able to share the complete stack trace for the key error? That will help us to better investigate how things are flowing through.

KentonParton commented 1 year ago

Hey @austiezr! Sure, here it is.

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]
Exceptions
{('table.head', 'batch_id=eb035c337dcddf2ab415c3323a06e1fa', '04166707abe073177c1dd922d3584468'): {'metric_configuration': {
  "metric_name": "table.head",
  "metric_domain_kwargs": {
    "batch_id": "eb035c337dcddf2ab415c3323a06e1fa"
  },
  "metric_domain_kwargs_id": "batch_id=eb035c337dcddf2ab415c3323a06e1fa",
  "metric_value_kwargs": {
    "n_rows": 5,
    "fetch_all": false
  },
  "metric_value_kwargs_id": "04166707abe073177c1dd922d3584468",
  "id": [
    "table.head",
    "batch_id=eb035c337dcddf2ab415c3323a06e1fa",
    "04166707abe073177c1dd922d3584468"
  ]
}, 'num_failures': 3, 'exception_info': {{'exception_traceback': 'Traceback (most recent call last):\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context\n    self.dialect.do_execute(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute\n    cursor.execute(statement, parameters)\npsycopg2.errors.UndefinedColumn: column "p_sizee" does not exist\nLINE 4: WHERE p_sizee % 2 = 1) AS great_expectations_sub_selection \n              ^\nHINT:  Perhaps you meant to reference the column "part.p_size".\n\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py", line 407, in resolve_metrics\n    resolved_metrics[metric_to_resolve.id] = metric_fn(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/expectations/metrics/metric_provider.py", line 34, in inner_func\n    return metric_fn(*args, **kwargs)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/expectations/metrics/table_metrics/table_head.py", line 139, in _sqlalchemy\n    df = pd.read_sql(sql, con=execution_engine.engine)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/pandas/io/sql.py", line 591, in read_sql\n    return pandas_sql.read_query(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/pandas/io/sql.py", line 1561, in read_query\n    result = self.execute(*args)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/pandas/io/sql.py", line 1406, in execute\n    return self.connectable.execution_options().execute(*args, **kwargs)\n  File "<string>", line 2, in execute\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/util/deprecations.py", line 401, in warned\n    return fn(*args, **kwargs)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3139, in execute\n    return connection.execute(statement, *multiparams, **params)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1289, in execute\n    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 463, in _execute_on_connection\n    return connection._execute_compiled(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1534, in _execute_compiled\n    ret = self._execute_context(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context\n    self._handle_dbapi_exception(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception\n    util.raise_(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 207, in raise_\n    raise exception\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context\n    self.dialect.do_execute(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 719, in do_execute\n    cursor.execute(statement, parameters)\nsqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedColumn) column "p_sizee" does not exist\nLINE 4: WHERE p_sizee % 2 = 1) AS great_expectations_sub_selection \n              ^\nHINT:  Perhaps you meant to reference the column "part.p_size".\n\n[SQL: SELECT * \nFROM (SELECT * \nFROM sample_data.part \nWHERE p_sizee %% 2 = 1) AS great_expectations_sub_selection \n LIMIT 5]\n(Background on this error at: https://sqlalche.me/e/14/f405)\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py", line 178, in resolve_validation_graph\n    self._execution_engine.resolve_metrics(\n  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py", line 411, in resolve_metrics\n    raise ge_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: (psycopg2.errors.UndefinedColumn) column "p_sizee" does not exist\nLINE 4: WHERE p_sizee % 2 = 1) AS great_expectations_sub_selection \n              ^\nHINT:  Perhaps you meant to reference the column "part.p_size".\n\n[SQL: SELECT * \nFROM (SELECT * \nFROM sample_data.part \nWHERE p_sizee %% 2 = 1) AS great_expectations_sub_selection \n LIMIT 5]\n(Background on this error at: https://sqlalche.me/e/14/f405)\n', 'exception_message': '(psycopg2.errors.UndefinedColumn) column "p_sizee" does not exist\nLINE 4: WHERE p_sizee % 2 = 1) AS great_expectations_sub_selection \n              ^\nHINT:  Perhaps you meant to reference the column "part.p_size".\n\n[SQL: SELECT * \nFROM (SELECT * \nFROM sample_data.part \nWHERE p_sizee %% 2 = 1) AS great_expectations_sub_selection \n LIMIT 5]\n(Background on this error at: https://sqlalche.me/e/14/f405)', 'raised_exception': True}}}}
occurred while resolving metrics.
INFO:     127.0.0.1:58214 - "POST /api/v1/datasets/sample HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/Users/me/code/data-quality/backend/app/core/sample.py", line 21, in get_dataset_sample
    sample = Runner(
  File "/Users/me/code/data-quality/backend/app/core/runner.py", line 95, in sample
    head = validator.head()
  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 348, in head
    return self._metrics_calculator.head(
  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/validator/metrics_calculator.py", line 90, in head
    data: Any = self.get_metric(
  File "/Users/me/code/data-quality/backend/.venv/lib/python3.9/site-packages/great_expectations/validator/metrics_calculator.py", line 120, in get_metric
    return self.get_metrics(
KeyError: 'table.head'
molliemarie commented 2 weeks ago

Hello @KentonParton. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗