great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
10k stars 1.55k forks source link

[BUG] Missing 'dataframe' in add_dataframe_asset #10687

Open data-han opened 3 days ago

data-han commented 3 days ago

Describe the bug According to the code docstring, dataframe should be part of the argument of add_dataframe_asset(). However, this is not the case. I'm trying to run on GX 1.2.3 and I don't know where should i link my dataframe to. Previously in 0.18.x, it is available as part of the function

image

To Reproduce

Expected behavior It should have a dataframe argument

Environment (please complete the following information):

Additional context Add any other context about the problem here.

adeola-ak commented 20 hours ago

hi, we'll update the doc string accordingly. dataframe should not be a part of the argument of add_dataframe_asset(). Your dataframe should be provided through Batch Parameters. Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.

The following example creates a dataframe by reading a .csv file and storing it in a Batch Parameter dictionary:

from pyspark.sql import SparkSession

csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
spark = SparkSession.builder.appName("Read CSV").getOrCreate()
dataframe = spark.read.csv(csv, header=True, inferSchema=True)

batch_parameters = {"dataframe": dataframe}

and you ultimately pass the Batch Parameter dictionary to a get_batch() or validate() method call

# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

More information on this here.