great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.64k stars 1.5k forks source link

[feature] substitution syntax is not working for table_name attribute of fluent_datasources (in Athena) #8333

Open hagai-arad opened 11 months ago

hagai-arad commented 11 months ago

Describe the bug When using a substitution syntax in great_expectations.yml (e.g. ${my_value}) in the table_name attribute of a data asset - it doesn't substitute by the env var value nor the value given in uncommitted/config_variables.yml. Instead, it uses the string itself as the table name when running queries against this data asset: select count(*) from ${table_name}.

To Reproduce great_expectations.yml config:

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources: {}

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GX encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials

config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  enabled: true
  data_context_id: 89ae52ad-b736-43d8-a4b5-2a91d10041a5
fluent_datasources:
  my_datasource:
    type: sql
    assets:
      my_asset:
        type: table
        order_by: []
        batch_metadata: {}
        table_name: ${TABLE_PREFIX}_my_table
        schema_name:
    connection_string: awsathena+rest://@athena.${REGION}.amazonaws.com/{SCHEMA_NAME}?s3_staging_dir=s3://${S3_BUCKET_NAME}/{SCHEMA_NAME}/staging/
notebooks:
include_rendered_content:
  globally: false
  expectation_validation_result: false
  expectation_suite: false

The executed code:

import great_expectations as gx

context = gx.get_context(context_root_dir=GX_ROOT_DIR)

result = context.run_checkpoint(checkpoint_name="my_checkpoint")

Full stack trace:

An SQL execution Exception occurred.  OperationalError: "(pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)".  
Traceback: "Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
    raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1112, in resolve_metric_bundle
    res = self.execute_query(sa_query_object).fetchall()
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1415, in execute_query
    result = connection.execute(query)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1412, in execute
    return meth(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 483, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1635, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1844, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1984, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2339, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
    raise OperationalError(query_execution.state_change_reason)
sqlalchemy.exc.OperationalError: (pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Expected behavior The ${TABLE_PREFIX} should have been replaced with the value stored in uncommitted/config_variables.yml or by the value of the TABLE_PREFIX env var. The substitution works in all other places in my great_expectations.yml file (e.g. connection_string).

Environment (please complete the following information):

austiezr commented 11 months ago

Hey @hagai-arad ! Thanks for raising this. We'll investigate what may be happening here.

Kilo59 commented 11 months ago

This is working as intended. We don't support config substitutions for arbitrary config fields for the Fluent Datasources. We only do it for things like passwords/credentials, connection_string etc.

I don't think this is well-documented.

This might be something we could add support for as part of a new feature request.

hagai-arad commented 11 months ago

@Kilo59 Thanks. Should I open a new feature request? Not sure how it works

Kilo59 commented 10 months ago

@hagai-arad yes, sorry for the slow response. Or you could edit this issue and title to highlight that it's a feature request.

It should be simple to implement. If you're interested, I can guide you to implement this yourself.

hagai-arad commented 10 months ago

@Kilo59 Hey, I would like to get your guidance. How can we proceed from here? Thanks!

Kilo59 commented 10 months ago

Sorry again @hagai-arad I missed the notification that you responded.

Our Datasource and DataAsset classes are pydantic models. You'll need to update the annotation on the TableAsset.name field to be a Union[ConfigStr, str] this will cause pydantic to coerce the value as a ConfigStr and only if it cannot (there's no ${} syntax) read it as a normal string.

https://docs.pydantic.dev/1.10/usage/types/#unions

https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L837

That should be enough by itself to get your desired behavior.

slocoro commented 4 months ago

Hi @Kilo59, Is this issue still up for grabs? If so I'd be happy to give it a go.

I have a similar use case where I want to replace the project_id of a query with a value in the config.yaml:

  bigquery_datasource:
    type: sql
    assets:
      customer:
        type: query
        order_by: []
        batch_metadata: {}
        query: select * from ${project_id}.dataset.customer

From what I can tell I would have to make the same annotation change you suggest above here? https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L805