great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.62k stars 1.49k forks source link

[BUG] Unable to use query instead of a table for some core expectations #8502

Open mantasmy opened 10 months ago

mantasmy commented 10 months ago

Describe the bug

TL;DR: Query usage instead of a table isn't fully working (for Trino). Table name with catalog and schema is used at filtering in information_schema.tables & information_schema.columns query when table_name (inside these tables) is without catalog and schema. After removing catalog and schema locally it works well, but not for all expectations - fails to format query.

Long Version: There is a special handling for Trino in utils.py in order to set the table_name from query. This regex sets table_name with schema and catalog and the problem occurs when we are using this table_name to build where condition for query.

.where(
    sa.and_(
        *(
            tables_table_query.c.table_name == table_name, # we are giving catalog.shema.table_name here, but expected only table_name
            tables_table_query.c.schema_name == schema_name, # only schema here
            # also in information_schema.tables catalog column exist so it can be used

        )
    )
)

This query is executed against information_schema tables, but table_name in information_schema.tables has no catalog and schema name.

I locally removed catalog and schema and it worked, but not for all expectations...

There are expectations like expect_compound_columns_to_be_unique where we have to set column_list as tuple or list and it fails to format query properly. Inside one of the subquery levels it is limited to select only columns that are in a column_list, but then parent level (of query) tries to fetch all columns that exist in a table... and this throws error about invalid columns.

I will try to visualise it better here:

SELECT
...
              column_list_column_2,
              column_that_exist_in_source_table_1,
              column_that_exist_in_source_table_2,
              ...
              column_that_exist_in_source_table_100
            FROM
              (
                SELECT
                  *
                FROM
                  (
                    SELECT
                      column_list_column_1, # column from column_list
                      column_list_column_2 # column from column_list
                    FROM
                      catalog.schema.table # our table that we are running tests on
                    ORDER BY
                      1
                    LIMIT
                      1
                  ) AS anon_2
                WHERE
                  NOT (
                    column_list_column_1 IS NULL # column from column_list
                    AND column_list_column_2 IS NULL # column from column_list
...

This is the connector I am using:

  "default_runtime_data_connector_name": {
      "batch_identifiers": ["default_identifier_name"],
      "module_name": "great_expectations.datasource.data_connector",
      "class_name": "RuntimeDataConnector",
  },

Environment: Great Expectations Version: 0.17.7 sqlalchemy: 1.4.48 trino: 0.326.0

HaebichanGX commented 10 months ago

hey @mantasmy thanks for reaching out! We'll put this in our backlog for review.

mantasmy commented 10 months ago

Hey @HaebichanGX, is there any ETA for this please? Thanks

mantasmy commented 8 months ago

Bump

athanan commented 6 months ago

Hi @mantasmy. I also found the issue with QueryAsset validation through Trino as well. Hope it can help you. So, the GE itself cannot resolve the name of the SQLAlchemy Selectable in Trino's metadata (information_schema.*). Here is my approach (it worked~)

  1. I enabled the GE can create a temporary table when using the Trino data source. (GE forces to disable this feature in v0.18.3)
  2. I changed how Trino creates a table to view in SqlAlchemyBatchData
  3. it will enable GE can resolve the Selectable name in Trino since there is a view stored in Trino's metadata Caution: this method requires you to drop the Trino temporary view. So, you need to implement it after GE can complete its validation