[BUG] Missing 'dataframe' in add_dataframe_asset

great-expectations / great_expectations

Always know what to expect from your data.

Apache License 2.0

10k stars 1.55k forks source link

Describe the bug According to the code docstring, dataframe should be part of the argument of add_dataframe_asset(). However, this is not the case. I'm trying to run on GX 1.2.3 and I don't know where should i link my dataframe to. Previously in 0.18.x, it is available as part of the function

To Reproduce

Expected behavior It should have a dataframe argument

Environment (please complete the following information):

Operating System: [e.g. Linux, MacOS, Windows] MacOs
Great Expectations Version: [e.g. 0.13.2] 1.2.3
Data Source: [e.g. Pandas, Snowflake] Spark
Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP] Local machine

Additional context Add any other context about the problem here.

hi, we'll update the doc string accordingly. dataframe should not be a part of the argument of add_dataframe_asset(). Your dataframe should be provided through Batch Parameters. Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.

The following example creates a dataframe by reading a .csv file and storing it in a Batch Parameter dictionary:

from pyspark.sql import SparkSession

csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
spark = SparkSession.builder.appName("Read CSV").getOrCreate()
dataframe = spark.read.csv(csv, header=True, inferSchema=True)

batch_parameters = {"dataframe": dataframe}

and you ultimately pass the Batch Parameter dictionary to a get_batch() or validate() method call

# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)

More information on this here.

great-expectations / great_expectations

[BUG] Missing 'dataframe' in add_dataframe_asset #10687