paddymul / buckaroo

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.
https://paddymul.github.io/buckaroo/
BSD 3-Clause "New" or "Revised" License
227 stars 9 forks source link

caching/multiple step pinned columns #305

Open paddymul opened 1 month ago

paddymul commented 1 month ago

Checks

How would you categorize this request. You can select multiple if not sure

Summary stats

Enhancement Description

When Buckaroo is used for exploratory data analysis, it is best to think of the different steps as a pipeline.

It is very useful to be able to compare different steps of the pipeline, particularly for summary stats and to be able to show those to the user. Some of these steps don't have explicit representation in Buckaroo state

Generally the flow goes raw_df -> cleaned_df -> filtered_df -> lowcode_transformed_df -> transformed_df -> summary_stats

cleaned_df -> filtered_df -> lowcode_transformed_df are all jumbled together, but by accident or user interaction convention, the user flow generally goes

raw_df -> summary_stats raw_df -> cleaned_df -> summary_stats

Then raw_df -> cleaned_df -> filtered_df -> summary_stats or raw_df -> cleaned_df -> transformed_df -> summary_stats

Finally, since it requires the most user interaction raw_df -> cleaned_df -> low_code_tranformed_df -> summary_stats

I would like to be able to show the following types of pinned_rows, if they exist

"dtype" for all states raw:dtype and cleaned:dtype are frequently different similarly for null_count

histograms are very likely to change between cleaned and filtered_df states. These should all be visible in the UI at once.

Thinking of this in terms of pure functions

You can think of these as functions

cleaned_df is a function of raw_df and cleaning_method filtered_df is a function of raw_df, cleaning_method, and filter_args transformed_df is a function of raw_df, cleaning_method, filter_args, and transform_method

For configuration, we don't want to name pinned_rows as the full args of cleaning_method and filter_args

instead we want be able to add rows as

summary_stats[('cleaned', current)] or summary_stats[('cleaned', current), ('filtered', current)]

All of this dovetails quite nicely with a caching/memoization mechanism.

Pseudo Code Implementation

Uhm

Prior Art

?