[X] I have checked that this enhancement has not already been requested
How would you categorize this request. You can select multiple if not sure
Summary stats
Enhancement Description
When Buckaroo is used for exploratory data analysis, it is best to think of the different steps as a pipeline.
It is very useful to be able to compare different steps of the pipeline, particularly for summary stats and to be able to show those to the user. Some of these steps don't have explicit representation in Buckaroo state
Generally the flow goes
raw_df -> cleaned_df -> filtered_df -> lowcode_transformed_df ->
transformed_df -> summary_stats
cleaned_df -> filtered_df -> lowcode_transformed_df are all jumbled
together, but by accident or user interaction convention, the user flow generally goes
Then
raw_df -> cleaned_df -> filtered_df -> summary_stats
or
raw_df -> cleaned_df -> transformed_df -> summary_stats
Finally, since it requires the most user interaction
raw_df -> cleaned_df -> low_code_tranformed_df -> summary_stats
I would like to be able to show the following types of pinned_rows, if
they exist
"dtype" for all states
raw:dtype and cleaned:dtype are frequently different
similarly for null_count
histograms are very likely to change between cleaned and filtered_df states. These should all be visible in the UI at once.
Thinking of this in terms of pure functions
You can think of these as functions
cleaned_df is a function of raw_df and cleaning_methodfiltered_df is a function of raw_df, cleaning_method, and filter_argstransformed_df is a function of raw_df, cleaning_method, filter_args, and transform_method
For configuration, we don't want to name pinned_rows as the full args of cleaning_method and filter_args
instead we want be able to add rows as
summary_stats[('cleaned', current)] or summary_stats[('cleaned', current), ('filtered', current)]
All of this dovetails quite nicely with a caching/memoization mechanism.
Checks
How would you categorize this request. You can select multiple if not sure
Summary stats
Enhancement Description
When Buckaroo is used for exploratory data analysis, it is best to think of the different steps as a pipeline.
It is very useful to be able to compare different steps of the pipeline, particularly for summary stats and to be able to show those to the user. Some of these steps don't have explicit representation in Buckaroo state
Generally the flow goes raw_df -> cleaned_df -> filtered_df -> lowcode_transformed_df -> transformed_df -> summary_stats
cleaned_df -> filtered_df -> lowcode_transformed_df are all jumbled together, but by accident or user interaction convention, the user flow generally goes
raw_df -> summary_stats raw_df -> cleaned_df -> summary_stats
Then raw_df -> cleaned_df -> filtered_df -> summary_stats or raw_df -> cleaned_df -> transformed_df -> summary_stats
Finally, since it requires the most user interaction raw_df -> cleaned_df -> low_code_tranformed_df -> summary_stats
I would like to be able to show the following types of pinned_rows, if they exist
"dtype" for all states raw:dtype and cleaned:dtype are frequently different similarly for null_count
histograms are very likely to change between cleaned and filtered_df states. These should all be visible in the UI at once.
Thinking of this in terms of pure functions
You can think of these as functions
cleaned_df
is a function ofraw_df
andcleaning_method
filtered_df
is a function ofraw_df
,cleaning_method
, andfilter_args
transformed_df
is a function ofraw_df
,cleaning_method
,filter_args
, andtransform_method
For configuration, we don't want to name pinned_rows as the full args of cleaning_method and filter_args
instead we want be able to add rows as
summary_stats[('cleaned', current)] or summary_stats[('cleaned', current), ('filtered', current)]
All of this dovetails quite nicely with a caching/memoization mechanism.
Pseudo Code Implementation
Uhm
Prior Art
?