great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.82k stars 1.51k forks source link

GE - AWS MWAA Airflow 2.0.2 integration #8133

Closed karthigai-selvan closed 1 year ago

karthigai-selvan commented 1 year ago

Describe the bug We are using GE for validating few reports and trying to move those validations as part of existing data pipeline. We are using AWS MWAA 2.0.2 hence it is not possible to use GreatExpectationOperator as it requires Airflow 2.1.0+. Hence using PythonVirtualenvOperator with great-expectations and required python modules as requirements to it. We are getting an error while initiating the data_context by the following command.

data_context: FileDataContext = get_context(context_root_dir=context_root_dir)

The context_root_dir is pointing to the great_expectations project directory in /usr/local/airflow directory. We are getting the below error.

File "/tmp/venvubafuq2v/lib/python3.7/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 268, in _scaffold_directories with open(os.path.join(base_dir, ".gitignore"), "w") as f: # noqa: PTH118 OSError: [Errno 30] Read-only file system: '/usr/local/airflow/great_expectations/.gitignore'

I verified the great_expectations directory we already have the .gitignore file with uncommitted/ added to it. How can we ignore this step if the .gitignore file already exists in the context_root_dir? Since the MWAA files are mostly read-only how we can integrate GE with it?

To Reproduce Please include your great_expectations.yml config, the code you’re executing that causes the issue, and the full stack trace of any error(s).

Expected behavior While initiating the great_expectations data_context if the .gitignore file already presents then it should not try to write it again.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

HaebichanGX commented 1 year ago

Acknowledged, and thank you for sharing this information with us and raising the issue! We’ve added this to our internal backlog to review this behavior.

karthigai-selvan commented 1 year ago

Have checked GE integration with AWS MWAA version 2.5.1 with airflow provided GreatExpectationsOperator still facing the same issue as GE is trying to open few files in write mode but AWS MWAA is read only.

ivanstillfront commented 1 year ago

We've observed this issue on MWAA when we upgraded GX from 0.15 to 0.17 Here is a copy of our trace:

Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations_provider/operators/great_expectations.py", line 557, in execute
    self.data_context = ge.data_context.DataContext(context_root_dir=self.data_context_root_dir)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/data_context.py", line 170, in DataContext
    context = BaseDataContext(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 187, in BaseDataContext
    return get_context(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/util.py", line 1917, in get_context
    file_context = _get_file_context(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/util.py", line 2047, in _get_file_context
    return FileDataContext(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 61, in __init__
    self._scaffold_project()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/file_data_context.py", line 93, in _scaffold_project
    self._scaffold(
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 197, in _scaffold
    cls._scaffold_directories(gx_dir)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/great_expectations/data_context/data_context/serializable_data_context.py", line 270, in _scaffold_directories
    with open(os.path.join(base_dir, ".gitignore"), "w") as f:  # noqa: PTH118
OSError: [Errno 30] Read-only file system: '/usr/local/airflow/dags/sfg/great_expectations/.gitignore'

Maybe the operator could create the context while omitting the scaffolding? For example:

context = DataContext(context_root_dir=foo, omit_dir_scaffolding=True)

Or maybe the scaffolding process could handle a read-only filesystem more graceful?

This is blocking us from upgrading to a more recent version of GX.