great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.97k stars 1.54k forks source link

[BUG] Exception 'type' during validation of ExpectColumnValuesToBeOfType and ExpectColumnValuesToBeInTypeList #10395

Closed jarcpro97 closed 1 month ago

jarcpro97 commented 1 month ago

Describe the bug Creating expectation for PostgreSQL and Athena I have the next Exception with ExpectColumnValuesToBeOfType and ExpectColumnValuesToBeInTypeList

{ "success": false, "expectation_config": { "type": "expect_column_values_to_be_oftype", "kwargs": { "column": "id", "type": "INTEGER", "batch_id": "postgresql_datasource" }, "meta": {} }, "result": {}, "meta": {}, "exception_info": { "exception_traceback": "Traceback (most recent call last):\n File \"c:\Users\Administrator\Documents\datalake-ingest\venv\Lib\site-packages\great_expectations\validator\validator.py\", line 648, in graph_validate\n result = expectation.metrics_validate(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"c:\Users\Administrator\Documents\datalake-ingest\venv\Lib\site-packages\great_expectations\expectations\expectation.py\", line 1093, in metrics_validate\n expectation_validation_result: Union[ExpectationValidationResult, dict] = self._validate(\n ^^^^^^^^^^^^^^^\n File \"c:\Users\Administrator\Documents\datalake-ingest\venv\Lib\site-packages\great_expectations\expectations\core\expect_column_values_to_be_of_type.py\", line 595, in _validate\n actual_column_type = [\n ^\n File \"c:\Users\Administrator\Documents\datalake-ingest\venv\Lib\site-packages\great_expectations\expectations\core\expect_column_values_to_be_of_type.py\", line 596, in \n type_dict[\"type\"]\n ~~~~~^^^^^^^^\nKeyError: 'type'\n", "exception_message": "'type'", "raised_exception": true } To Reproduce

origin_data_source_name = "postgresql_datasource" postgresql_source = context.data_sources.add_or_update_postgres(origin_data_source_name, connection_string=connection_string)

asset_name = "postgresql_query_asset" asset_query = f'SELECT * FROM {schema_name}."{table_name}"' query_data_asset = postgresql_source.add_query_asset(query=asset_query, name=asset_name)

full_table_batch_definition = query_data_asset.add_batch_definition_whole_table( name="FULL_TABLE_ORIGIN" ) full_table_batch = full_table_batch_definition.get_batch() full_table_batch.head()

works well until here, but when I validate this expectations

expectation = gx.expectations.ExpectColumnValuesToBeOfType(column="id", type_="INTEGER") validation_results = full_table_batch.validate(expectation)

I got the exception

Expected behavior Return success true or false

Environment (please complete the following information):

adeola-ak commented 1 month ago

hi there, do you have other expectations in this same file that are passing without error?

jarcpro97 commented 1 month ago

Hi, I have those other expectations and they are working well: gx.expectations.ExpectTableRowCountToEqual(), gx.expectations.ExpectColumnMinToBeBetween(), gx.expectations.ExpectColumnMaxToBeBetween(), gx.expectations.ExpectCompoundColumnsToBeUnique()

Is only with this column type expectations

jarcpro97 commented 1 month ago

Doing other tests we found that the error does not happen with add_table_asset, only with add_query_asset. Example: asset_name = "query_asset_target" asset_query = f'SELECT * FROM {prefix_athena}{table_name} ORDER BY {column_order_by} ASC' query_data_asset_target = athena_source.add_query_asset(query=asset_query, name=asset_name) batch_definition = query_data_asset_target.add_batch_definition_whole_table( name="FULL_TABLE_TARGET" ) batch = batch_definition.get_batch()

adeola-ak commented 1 month ago

Hi there,

Thank you for sharing this additional information.

The fact that the issue doesn't occur with add_table_asset but does with query-based assets suggests that the error you’re facing could be related to the limitations of Athena support in Great Expectations. Due to these limitations, it's reasonable to expect that column type detection or certain expectation configurations may not work as seamlessly. Additionally, Athena uses a slightly different SQL dialect, and some data types may not be recognized or mapped correctly within Great Expectations.

As a workaround, you might try explicitly casting types in your query to help force Athena to return type information in a way that Great Expectations can recognize. For example:

SELECT CAST(column_name AS expected_type) FROM ...

You can probably also try running the query in Athena and store the result in a temporary table and use add_table_asset to validate against this temporary table since you didn't find any errors with add_table_asset

Because Athena is not a fluent data source, I will be closing but monitoring this issue as I am going to share this gap with the team and circle back with updates

victorgrcp commented 2 weeks ago

Hi! I'm working with a similar approach but using a Databricks SQL Warehouse as data source and query-based assets and I get the same Exception. Just to raise the problem :)

I'm using GX version 1.1.2