Open data-han opened 3 days ago
hi, we'll update the doc string accordingly. dataframe
should not be a part of the argument of add_dataframe_asset()
. Your dataframe should be provided through Batch Parameters. Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.
The following example creates a dataframe by reading a .csv file and storing it in a Batch Parameter dictionary:
from pyspark.sql import SparkSession
csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
spark = SparkSession.builder.appName("Read CSV").getOrCreate()
dataframe = spark.read.csv(csv, header=True, inferSchema=True)
batch_parameters = {"dataframe": dataframe}
and you ultimately pass the Batch Parameter dictionary to a get_batch() or validate() method call
# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
More information on this here.
Describe the bug According to the code docstring,
dataframe
should be part of the argument of add_dataframe_asset(). However, this is not the case. I'm trying to run on GX 1.2.3 and I don't know where should i link my dataframe to. Previously in 0.18.x, it is available as part of the functionTo Reproduce
Expected behavior It should have a
dataframe
argumentEnvironment (please complete the following information):
Additional context Add any other context about the problem here.