great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.99k stars 1.54k forks source link

Checkpoints do not work from within Databricks notebooks #2905

Closed anthonyburdi closed 3 years ago

anthonyburdi commented 3 years ago

Describe the bug Configuring and running a checkpoint from within a DataBricks notebook causes an error. It seems that the error is caused by serialization using pickle during a deepcopy step from within the Checkpoint config.

Here is part of a traceback from within a notebook which configures and then runs a checkpoint:

image
# checkpoint_result = checkpoint.run()

# This is currently non functional. Relevant part of traceback copied below:

# ---------------------------------------------------------------------------
# TypeError                                 Traceback (most recent call last)
# <command-3790954559126540> in <module>
# ----> 1 checkpoint_result = checkpoint.run()

# /databricks/python/lib/python3.7/site-packages/great_expectations/checkpoint/checkpoint.py in run(self, template_name, run_name_template, expectation_suite_name, batch_request, action_list, evaluation_parameters, runtime_configuration, validations, profilers, run_id, run_name, run_time, result_format, **kwargs)
#     234         }
#     235         substituted_runtime_config: CheckpointConfig = self.get_substituted_config(
# --> 236             runtime_kwargs=runtime_kwargs
#     237         )
#     238         run_name_template: Optional[str] = substituted_runtime_config.run_name_template

# /databricks/python/lib/python3.7/site-packages/great_expectations/checkpoint/checkpoint.py in get_substituted_config(self, config, runtime_kwargs)
#     136 
#     137             if not template_name:
# --> 138                 substituted_config = copy.deepcopy(config)
#     139                 if any(runtime_kwargs.values()):
#     140                     substituted_config.update(runtime_kwargs=runtime_kwargs)

To Reproduce Steps to reproduce the behavior:

  1. Set up a data context in-code within a databricks notebook using s3 as the metadata store backend (expectations store, validations store, checkpoint store, data docs)
  2. Load or create an expectation suite
  3. Create a checkpoint (perhaps also load a checkpoint from a store)
  4. Run a checkpoint with checkpoint.run() or context.run_checkpoint()

Expected behavior I expect the checkpoint to run with all configured validation actions.

Environment (please complete the following information):

Additional context Validation is run using a databricks dataframe as the datasource and a RuntimeDataConnector to connect to the dataframe.

wesleyfelipe commented 3 years ago

Hello,

I'm getting the same error running checkpoint using a spark dataframe and RuntimeDataConnector. I've got this error running on Databricks (as well as @anthonyburdi ), but I'm also getting the same result running on my local machine (on Jupyter notebook).

Here is the notebook in case it helps: great_expectations_poc.zip

jvetu commented 3 years ago

I believe the error is persisting. I installed from 0.13.34 but I still get TypeError: cannot pickle '_thread.RLock' object when passing a spark dataframe as batch_data.

davidmaddox-saic commented 3 years ago

I concur with @jvetu. I am running code very similar to @wesleyfelipe, but using a spark data frame. I get the same error. I'm running version 0.13.35.

Would that be considered a separate issue?

NathanFarmer commented 3 years ago

Hi @wesleyfelipe @jvetu @davidmaddox-saic! This issue has been addressed by PR #3502 and will be included in release 0.13.38.

alitsaberi commented 3 years ago

Hi @NathanFarmer, I'm using version 0.13.42 and still get this error when trying to run a checkpoint on a spark data frame.

checkpoint_run_result = context.run_checkpoint(
    checkpoint_name="my_checkpoint",
    batch_request={
        "runtime_parameters": {
            "batch_data": df,
        }
    },
    run_name="Hello",
)
NathanFarmer commented 3 years ago

Hi @alit8, using RuntimeBatchRequests in Checkpoints is under development in the current sprint. You should be able to get around this for now by passing the batch request into a Checkpoint object using validations. Open PR #3680 addresses passing the RuntimeBatchRequest into a Checkpoint object instead of into context.run_checkpoint.