great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.95k stars 1.54k forks source link

Bad input to build_batch_request: options must contain exactly 1 key, 'dataframe'. #10270

Closed SiddhantSadangi closed 2 months ago

SiddhantSadangi commented 2 months ago

Describe the bug Cannot run a checkpoint on a validation suite when using a dataframe asset

To Reproduce Code:

import great_expectations as gx
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

context = gx.get_context()

data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(name="expectations")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="passenger_count", min_value=1, max_value=6
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(column="fare_amount", min_value=0)
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="pickup_datetime")
)

validation_definition = context.validation_definitions.add(
    gx.core.validation_definition.ValidationDefinition(
        name="validation definition",
        data=batch_definition,
        suite=suite,
    )
)

checkpoint = context.checkpoints.add(
    gx.checkpoint.checkpoint.Checkpoint(
        name="checkpoint", validation_definitions=[validation_definition]
    )
)

checkpoint_result = checkpoint.run()
print(checkpoint_result.describe())

Stack trace:

Traceback (most recent call last):
  File "c:\Users\siddh\Code\Neptune\adhoc\test.py", line 45, in <module>
    checkpoint_result = checkpoint.run()
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\checkpoint\checkpoint.py", line 178, in run
    run_results = self._run_validation_definitions(
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\checkpoint\checkpoint.py", line 199, in _run_validation_definitions
    validation_result = validation_definition.run(
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\core\validation_definition.py", line 234, in run
    results = validator.validate_expectation_suite(self.suite, expectation_parameters)
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\validator\v1_validator.py", line 63, in validate_expectation_suite
    results = self._validate_expectation_configs(
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\validator\v1_validator.py", line 117, in _validate_expectation_configs
    processed_expectation_configs = self._wrapped_validator.process_expectations_for_validation(
  File "C:\Users\siddh\AppData\Local\Programs\Python\Python310\lib\functools.py", line 981, in __get__
    val = self.func(instance)
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\validator\v1_validator.py", line 106, in _wrapped_validator
    batch_request = self._batch_definition.build_batch_request(
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\core\batch_definition.py", line 56, in build_batch_request
    return self.data_asset.build_batch_request(
  File "C:\Users\siddh\python_envs\py310\lib\site-packages\great_expectations\datasource\fluent\pandas_datasource.py", line 407, in build_batch_request
    raise BuildBatchRequestError(message="options must contain exactly 1 key, 'dataframe'.")
great_expectations.exceptions.exceptions.BuildBatchRequestError: Bad input to build_batch_request: options must contain exactly 1 key, 'dataframe'.

Expected behavior Checkpoint run without any issues

Environment (please complete the following information):

Additional context Add any other context about the problem here.

adeola-ak commented 2 months ago

hi there, thank you for bringing this to our attention. I was able to replicate this error even when removing the checkpoint.. seems something else is causing batch creation to fail. I will continue to take a look into this and have escalated it as well -- check back with you soon

adeola-ak commented 2 months ago

Okay it looks like we will have to work on getting the error to be a little more helpful here.

I've made a few edits to the provided file, very minimal - the biggest one adding this line: validation_results = validation_definition.run(batch_parameters=batch_parameters)

A validation_definition.run() command needs to be present and needs to know what batch to run against by specifying batch_parameters on the validation_definition.run method

This should solve your issue:

import great_expectations as gx # type: ignore
import pandas as pd # type: ignore

df = pd.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

context = gx.get_context(mode="file")
print(context)

data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    "batch-def"
)

batch_definition = (
   context.data_sources.get("pandas").get_asset("pd dataframe asset")
    .get_batch_definition("batch-def")
)

batch_parameters = {"dataframe": df}

batch = batch_definition.get_batch(batch_parameters=batch_parameters)

suite = gx.ExpectationSuite(name="expectation_suite-4")
suite = context.suites.add(suite)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="passenger_count", min_value=1, max_value=6
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(column="fare_amount", min_value=0)
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="pickup_datetime")
)

definition_name = "validation_definition-4"
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=suite, name=definition_name
)

validation_results = validation_definition.run(batch_parameters=batch_parameters)
print(validation_results)
katharine-fuzesi commented 2 months ago

Please update the documentation here: https://docs.greatexpectations.io/docs/core/run_validations/run_a_validation_definition

adeola-ak commented 2 months ago

Updated ValidationDefinition API docs

chrishartono commented 2 months ago

Expected behavior Checkpoint run without any issues

CC @SiddhantSadangi

I have encountered the same issue before and found the solution, I hope this will help, cheers!

Solution

""" 
add the batch_parameters when calling checkpoint.run(...)

during the checkpoint.run(...) execution, 
it will call validation_definition.run(...) inside it as described above
"""
df: pandas.DataFrame = ...
batch_parameters = {'dataframe': df}
checkpoint_result = checkpoint.run(batch_parameters=batch_parameters)
SiddhantSadangi commented 1 month ago

Hey @chrishartono , @adeola-ak Thanks, I'll check the workarounds and let you know if it works 🤗

SiddhantSadangi commented 1 week ago

Sorry for the delay here, but I was finally able to test it. Works for me now ✅