great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.87k stars 1.52k forks source link

Unable to retrieve object from TupleFilesystemStoreBackend on Azure Databricks since yesterday #9713

Closed damczyk closed 1 month ago

damczyk commented 6 months ago

Describe the bug Since yesterday running the GX code context.get_checkpoint(onboarding_checkpoint_name).run(run_name=onboarding_checkpoint_name, batch_request=batch_request) leads to the error

InvalidKeyError: **Unable to retrieve object from TupleFilesystemStoreBackend** with the following Key: /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/del/control_tower_ccu_datasources_godistributiondates.json
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/data_context/store/tuple_store_backend.py:320, in TupleFilesystemStoreBackend._get(self, key)
    319 try:
--> 320     with open(filepath) as infile:
    321         contents: str = infile.read().rstrip("\n")
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/data_context/store/tuple_store_backend.py:323, in TupleFilesystemStoreBackend._get(self, key)
    321         contents: str = infile.read().rstrip("\n")
    322 except FileNotFoundError:
--> 323     raise InvalidKeyError(
    324         f"Unable to retrieve object from TupleFilesystemStoreBackend with the following Key: {filepath!s}"
    325     )
    327 return contents

It worked yesterday until about 4 p.m. (CET) Since then I get the error. Since then GX tries to get the JSON for the expectation suite from a sub-directory /del/, which does not exist /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/del/control_tower_ccu_datasources_godistributiondates.json

The JSON control_tower_ccu_datasources_godistributiondates.json is stored in dir /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations (without /del/) and has always been the days before.

To Reproduce Please include your great_expectations.yml config, the code you’re executing that causes the issue, and the full stack trace of any error(s). I can only give you my context.json because I'm on Databricks using Ephemeral Data Context. I started my Databricks cluster using Library great_expectations[azure_secrets]==0.18.12 (and also on great_expectations[azure_secrets]==0.18.10)

I started a Databricks notebook with import great_expectations as gx besides other libraries and then I build my dataframe object with my data called df_Staging. In one cell of my notebook all GX code is executed like this:

# Create or load great expectations context
    context = GX_Context(context_root_dir, context_connection_string).get_gx_context()

    # Create batch request
    batch_request = (context
        .sources
        .add_or_update_spark(name=data_source_name)
        .add_dataframe_asset(name=data_asset_name, dataframe=df_Staging)
        .build_batch_request()
    )

    # Run Onboarding checkpoint
    control_tower_ccu_datasources_block_checkpoint_result = context.get_checkpoint(onboarding_checkpoint_name).run(run_name=onboarding_checkpoint_name, batch_request=batch_request)

    # Check the validation result
    if control_tower_ccu_datasources_block_checkpoint_result.success:
        print("The validation succeeded")

Class GX_Context with method get_gx_context() looks like this:

import great_expectations as gx

class GX_Context:
    def __init__(self, root_dir, connection_string, filepath_prefix=""):
        self.root_dir = root_dir
        self.connection_string = connection_string
        self.filepath_prefix = filepath_prefix

        if not isinstance(self.root_dir, str) or not self.root_dir:
            raise ValueError("root_dir must be a non-empty string")
        if not isinstance(self.connection_string, str) or not self.connection_string:
            raise ValueError("connection_string must be a non-empty string")

    def get_gx_context(self) -> gx.data_context.EphemeralDataContext:
        project_config = gx.data_context.types.base.DataContextConfig(
            store_backend_defaults=gx.data_context.types.base.FilesystemStoreBackendDefaults(
                root_directory=self.root_dir
            ),
            data_docs_sites={
                "az_site": {
                    "class_name": "SiteBuilder",
                    "store_backend": {
                        "class_name": "TupleAzureBlobStoreBackend",
                        "container": r"\$web",
                        # "filepath_prefix": self.filepath_prefix,
                        "connection_string": self.connection_string,
                    },
                    "site_index_builder": {
                        "class_name": "DefaultSiteIndexBuilder",
                        "show_cta_footer": True,
                    },
                }
            },
        )

        context = gx.get_context(project_config=project_config)

        return context

Expected behavior A clear and concise description of what you expected to happen.

Executing checkpoint context.get_checkpoint(onboarding_checkpoint_name).run(run_name=onboarding_checkpoint_name, batch_request=batch_request) reads the expectation suite from /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/control_tower_ccu_datasources_godistributiondates.json as the days before and not from /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/**del/**control_tower_ccu_datasources_godistributiondates.json and delivers a result instead of an error

Environment (please complete the following information):

Additional context Add any other context about the problem here.

damczyk commented 6 months ago

Before this error I used Data Assistant to create an expectation suite for control_tower_ccu_datasources_godistributiondates.json

During the profiling process the code exited with an error. Unfortunately I do not remember which error this was. But seemed to be a resource problem.

Now it seems that at some place GX still has a handle on file /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/del/control_tower_ccu_datasources_godistributiondates.json

When I put a file named control_tower_ccu_datasources_godistributiondates.json in location /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/del/ at least the other expectations in my other notebooks can be executed. This works as a workaround.

So the problem is now reduced to the question how to get rid of the handle to /dbfs/mnt/sdl/control-tower-ccu/DataQuality/GX/expectations/del/control_tower_ccu_datasources_godistributiondates.json ? Terminating and restarting the cluster also with different gx libs (from great_expectations[azure_secrets]==0.18.10 to great_expectations[azure_secrets]==0.18.12) does not have any effect.

Code for executing the Data Assistant to create the expectation suite was:

try:
    # Create or load great expectations context
    context = GX_Context(context_root_dir, context_connection_string).get_gx_context()

    # Create batch request
    batch_request = (context
        .sources
        .add_or_update_spark(name=data_source_name)
        .add_dataframe_asset(name=data_asset_name, dataframe=df_cleansing)
        .build_batch_request()
    )

    # Profiler
    # Run the default onboarding profiler on the batch request
    onboarding_data_assistant_result = (context
        .assistants
        .onboarding
        .run(
            batch_request=batch_request,
            exclude_column_names=[],
            estimation="exact"
        )
    )

    # Get the expectation suite from the onboarding result
    onboarding_suite = (onboarding_data_assistant_result
        .get_expectation_suite(
            expectation_suite_name=onboarding_suite_name
        )
    )

    # Perist expectation suite with the specified suite name from above
    context.add_or_update_expectation_suite(expectation_suite=onboarding_suite)

    # Create and persist checkpoint to reuse for multiple batches
    context.add_or_update_checkpoint(
        name=onboarding_checkpoint_name,
        batch_request=batch_request,
        expectation_suite_name=onboarding_suite_name,
    )

    # Run Onboarding checkpoint
    control_tower_ccu_datasources_block_checkpoint_result = context.get_checkpoint(onboarding_checkpoint_name).run(run_name=onboarding_checkpoint_name)

    # Check the validation result
    if control_tower_ccu_datasources_block_checkpoint_result.success:
        print("The validation succeeded")
    else:
        dbutils.notebook.exit("The validation failed : " + control_tower_ccu_datasources_block_checkpoint_result["run_results"][list(control_tower_ccu_datasources_block_checkpoint_result["run_results"].keys())[0]]["actions_results"]["update_data_docs"]["az_site"])

except Exception as exception:
    handle_exception(exception, dbutils.notebook.entry_point.getDbutils().notebook().getContext())
    raise exception
molliemarie commented 1 month ago

Hello @damczyk. With the launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗