great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.71k stars 1.5k forks source link

No possibility to specify relative path for PandasFileSystemDatasource in great_expectations.yml #8781

Open flepknor opened 10 months ago

flepknor commented 10 months ago

Describe the bug When using a PandasFileSystemDatasource the relative path specified in the great_expectations.yml configuration file seems is taken relative to the current working directory instead of the context_root_dir specified in the call to gx.get_context. To Reproduce

import great_expectations as gx
context = gx.get_context(context_root_dir="/tmp/pytest-of-flepknor/pytest-12/test_download_flow_good0/gx_test_context")
datasource = context.get_datasource("parquet_metadata")
datasource.test_connection()

leads to

/home/flepknor/.conda/envs/gx/bin/python /home/flepknor/repos/blabla/scripts/billo_example.py 
Traceback (most recent call last):
  File " /home/flepknor/repos/blabla/scripts/billo_example.py ", line 4, in <module>
    datasource.test_connection()
  File "/home/flepknor/.conda/envs/gx/lib/python3.10/site-packages/great_expectations/datasource/fluent/pandas_filesystem_datasource.py", line 55, in test_connection
    raise TestConnectionError(
great_expectations.datasource.fluent.interfaces.TestConnectionError: Path: /home/flepknor/repos/blabla/scripts/metadata does not exist.

Yaml config file contents:

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources: {}

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GX encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials

config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: ../data_docs/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

fluent_datasources:
  parquet_metadata:
    type: pandas_filesystem
    assets:
      u_metadata:
        type: parquet
        batching_regex: (?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<hour>\d{2})\.parquet
    base_directory: metadata/
notebooks:
include_rendered_content:
  globally: false
  expectation_suite: false
  expectation_validation_result: false

Expected behavior Would expect the datasource to be configured to point to /tmp/pytest-of-flepknor/pytest-12/test_download_flow_good0/metadata/. This would be in line with "Using relative paths as the base_directory of a Filesystem Data Source

If you are using a Filesystem Data Context you can provide a path for base_directory that is relative to the folder containing your Data Context." as stated here

Environment (please complete the following information):

Additional context

r34ctor commented 10 months ago

Thanks for raising this @flepknor. We've captured this for review.