great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.71k stars 1.5k forks source link

Databricks workflow errors on _scaffold attempts #8697

Open coreycradduck opened 10 months ago

coreycradduck commented 10 months ago

When instantiating a FileDataContext in my Databricks workflow, I receive an OSError due to some scaffolding attempts. GX is trying to access/modify great_expectations/plugins/custom_data_docs/styles/data_docs_custom_styles.css, which already exists, and because it's running from a specific Git commit rather than from the repo, it's having an issue modifying the file. I had these errors also for custom_data_docs/renderers, views, etc. until I created .gitkeep files so the data_context wouldn't try to set up empty directories. If I run the .py file directly in Databricks using the repository, there is no issue, so it's most likely tied to the attempt by GX to access the specific commit.

When I pin version 0.16.3, I don't have this issue, so I'm not sure what change caused this and if there's an option to disable this automatic data docs scaffolding.

To Reproduce

config_version: 3.0
config_variables_file_path: config_variables.yml
plugins_directory: plugins/

datasources:
  pandas:
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
      class_name: PandasExecutionEngine
      module_name: great_expectations.execution_engine
    data_connectors:
      default:
        name: default
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
        batch_identifiers:
          - batch_id

stores:
  az_validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleAzureBlobStoreBackend
      container: great-expectations
      prefix: validations
      connection_string: ${GX_STORAGE_CONNECTION_STRING}

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore

  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

checkpoint_store_name: checkpoint_store
evaluation_parameter_store_name: evaluation_parameter_store
expectations_store_name: expectations_store
profiler_store_name: profiler_store
validations_store_name: az_validations_store

data_docs_sites:
  customer_data_docs:
    class_name: SiteBuilder
    show_how_to_buttons: false
    store_backend:
      class_name: TupleAzureBlobStoreBackend
      container: \$web
      connection_string: ${GX_STORAGE_CONNECTION_STRING}
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

notebooks:
anonymous_usage_statistics:
  data_context_id: 2bdfdfcd-6777-41e8-a97f-db4da50128c0
  enabled: false
include_rendered_content:
  globally: false
  expectation_suite: false
  expectation_validation_result: false
     28 logging.info("Loading Great Expectations context...")
     29 project_root_dir = Path.cwd().parent.as_posix()
---> 30 gx_context = gx.get_context(project_root_dir=project_root_dir)

... 

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/data_context/data_context/serializable_data_context.py:364, in SerializableDataContext._scaffold_custom_data_docs(cls, plugins_dir)
    357 styles_template = file_relative_path(
    358     __file__,
    359     "../../render/view/static/styles/data_docs_custom_styles_template.css",
    360 )
    361 styles_destination_path = os.path.join(  # noqa: PTH118
    362     plugins_dir, "custom_data_docs", "styles", "data_docs_custom_styles.css"
    363 )
--> 364 shutil.copyfile(styles_template, styles_destination_path)

File /usr/lib/python3.10/shutil.py:256, in copyfile(src, dst, follow_symlinks)
    254 with open(src, 'rb') as fsrc:
    255     try:
--> 256         with open(dst, 'wb') as fdst:
    257             # macOS
    258             if _HAS_FCOPYFILE:
    259                 try:

OSError: [Errno 22] Invalid argument: '/Workspace/Repos/.internal/856753602b_commits/6e9b4a83d6fc5188ceab2d4268132e3312922525/great_expectations/plugins/custom_data_docs/styles/data_docs_custom_styles.css'

Expected behavior No error is raised.

Environment (please complete the following information):

austiezr commented 10 months ago

Hey @coreycradduck ! Thanks for reaching out. We've captured this for review. 🚀

EliLauwers commented 6 months ago

Hey @coreycradduck by any chance did you find a solution? I have the exact same situation :'(

jchakravarthy commented 5 months ago

any updates on this issue, we are facing this issue in MWAA environment and cannot be handled since dags folder is readonly here.

lmcewen-helix commented 4 months ago

We are also experiencing the same issue in the MWAA environment as we updated Great Expectations to 0.18.1 from 0.15.30. It appears there have been other fixes to ensure functionality in a read-only environment (https://github.com/great-expectations/great_expectations/pull/8362) - @alexsherstinsky ? This current issue is specific to data_docs_custom_styles.css. Like the original post, this appears to be the line of code causing the problem: https://github.com/great-expectations/great_expectations/blob/61901f68cc0b679dcc57d51182ac0b041d4c98d5/great_expectations/data_context/data_context/serializable_data_context.py#L342

Has anyone found a fix or are there any updates?

binumon-bst commented 3 months ago

Just a workaround that we identified The issue here is that the ge_context_root_dir is having strict permission settings by root user . To override that,
Create a CustomGreatExpectation operator class on top of the existing GreatExpectationOperator. Add a pre_execute method to move the data from ge_context_root_dir to /tmp and use it as the new ge_context_root_dir while initialising the operator. I know this is not a perfect solution, but just a workaround that we identified.

joaovbelchi commented 1 month ago

Hi! I've had the exact problem regarding trying to access great_expectations/plugins/custom_data_docs/styles/data_docs_custom_styles.css with read/write permissions on my Airflow application running on Kubernetes. I found no other solution other than downgrading the version to the one you suggested (0.16.3). Hope we see a solution soon.