great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.97k stars 1.54k forks source link

EphemeralDataContext does not load fluent datasources correctly #9283

Closed leotrs closed 2 months ago

leotrs commented 9 months ago

Describe the bug EphemeralDataContext does not load fluent datasources correctly (as part of project_config).

To Reproduce Use this code to reproduce:

import great_expectations as gx

# Create two different contexts using THE SAME config
file_ctx = gx.data_context.FileDataContext.create(<path_to_great_expectations_yml_file>)
ephm_ctx = gx.data_context.EphemeralDataContext(project_config=file_ctx.config)

# Compare their datasources...
print(f"{file_ctx.datasources=}")
print(f"{ephm_ctx.datasources=}")

At the bottom of this note you can see the example great_expectations.yml configuration file I am using. The above code shows:

file_ctx.datasources={'my_bq_source': SQLDatasource(type='sql', name='my_bq_source', id=None, assets=[TableAsset(name='panel_asset', type='table', id=None, order_by=[], batch_metadata={}, splitter=None, table_name='taxi_zone_geom', schema_name=None)], connection_string='bigquery://bigquery-public-data/new_york_taxi_trips', create_temp_table=False, kwargs={})}
ephm_ctx.datasources={}

As can be seen, the FileDataContext recognizes the fluent datasource as a datasource, whereas the EphemeralDataContext does not, even when its config is extracted from that of the FileDataContext!

Furthermore, I have been able to trace the problem to the fluent_config attribute:

print(f"{file_ctx.fluent_config=}")
print(f"{ephm_ctx.fluent_config=}")

Output:

file_ctx.fluent_config=GxConfig(fluent_datasources=[SQLDatasource(type='sql', name='my_bq_source', id=None, assets=[TableAsset(name='table_asset', type='table', id=None, order_by=[], batch_metadata={}, splitter=None, table_name='taxi_zone_geom', schema_name=None)], connection_string='bigquery://bigquery-public-data/new_york_taxi_trips', create_temp_table=False, kwargs={})])
ephm_ctx.fluent_config=GxConfig(fluent_datasources=[])

The reason for this discrepancy seems to be the _load_fluent_config method, which is called as part of the base class's constructor, here. FileDataConfig overrides it (here), while EphemeralDataConfig does not and simply falls back to the parent's class implementation, that is AbstractDataContext._load_fluent_config. However, this is essentially not implemented and simply returns an empty container (as seen here).

This is to say, only the FileDataConfig makes the attempt to ingest the fluent datasources, while any other class that does not override _load_fluent_config will miss these sources.

Expected behavior I would expect the EphemeralDataContext to process the given configuration in exactly the same way as the FileDataContext, namely the fluent datasource should be recognized.

Environment (please complete the following information):

Additional context Contents of the great_expectations.yml configuration file:

``` # Welcome to Great Expectations! Always know what to expect from your data. # # Here you can define datasources, batch kwargs generators, integrations and # more. This file is intended to be committed to your repo. For help with # configuration please: # - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource # - Join our slack channel: http://greatexpectations.io/slack # config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility # It is auto-generated and usually does not need to be changed. config_version: 3.0 # Datasources tell Great Expectations where your data lives and how to get it. # Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview datasources: {} # This config file supports variable substitution which enables: 1) keeping # secrets out of source control & 2) environment-based configuration changes # such as staging vs prod. # # When GX encounters substitution syntax (like `my_key: ${my_value}` or # `my_key: $my_value`) in the great_expectations.yml file, it will attempt # to replace the value of `my_key` with the value from an environment # variable `my_value` or a corresponding key read from this config file, # which is defined through the `config_variables_file_path`. # Environment variables take precedence over variables defined here. # # Substitution values defined here can be a simple (non-nested) value, # nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR}) # # # https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials config_variables_file_path: uncommitted/config_variables.yml # The plugins_directory will be added to your python path for custom modules # used to override and extend Great Expectations. # plugins_directory: plugins/ stores: # Stores are configurable places to store things like Expectations, Validations # Data Docs, and more. These are for advanced users only - most users can simply # leave this section alone. # # Three stores are required: expectations, validations, and # evaluation_parameters, and must exist with a valid store entry. Additional # stores can be configured for uses such as data_docs, etc. expectations_store_GCS: class_name: ExpectationsStore store_backend: class_name: TupleGCSStoreBackend project: bigquery-public-data bucket: some-bucket prefix: expectations validations_store_GCS: class_name: ValidationsStore store_backend: class_name: TupleGCSStoreBackend project: bigquery-public-data bucket: some-bucket prefix: validations evaluation_parameter_store: class_name: EvaluationParameterStore checkpoint_store_GCS: class_name: CheckpointStore store_backend: class_name: TupleGCSStoreBackend project: bigquery-public-data bucket: some-bucket prefix: checkpoints expectations_store_name: expectations_store_GCS validations_store_name: validations_store_GCS evaluation_parameter_store_name: evaluation_parameter_store checkpoint_store_name: checkpoint_store_GCS data_docs_sites: # Data Docs make it simple to visualize data quality in your project. These # include Expectations, Validations & Profiles. The are built for all # Datasources from JSON artifacts in the local repo including validations & # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs local_site: class_name: SiteBuilder show_how_to_buttons: true store_backend: class_name: TupleFilesystemStoreBackend base_directory: uncommitted/data_docs/local_site/ site_index_builder: class_name: DefaultSiteIndexBuilder anonymous_usage_statistics: data_context_id: 00000000-0000-0000-0000-00000000e003 enabled: true fluent_datasources: my_bq_source: type: sql assets: table_asset: type: table order_by: [] batch_metadata: {} table_name: taxi_zone_geom schema_name: connection_string: bigquery://bigquery-public-data/new_york_taxi_trips notebooks: include_rendered_content: globally: false expectation_suite: false expectation_validation_result: false plugins_directory: ```

NOTE: The code snippets above will also output some warnings, related the fact that the configuration file contains dummy names for GCS buckets. These are irrelevant to the present issue. I have tested this code with actual GCS buckets and the warnings go away but the problem persists.

molliemarie commented 2 months ago

Hello @leotrs. With the launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗