paddymul / buckaroo

Buckaroo - the data wrangling assistant for pandas. Quickly explore dataframes, and run pandas commands via a GUI. Works inside the jupyter notebook.
https://buckaroo-data.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
174 stars 8 forks source link

Feat: limit / specify number of rows to display #253

Open pep-sanwer opened 5 months ago

pep-sanwer commented 5 months ago

Checks

How would you categorize this request. You can select multiple if not sure

Display (is this related to visual display of a value)

Enhancement Description

As far as I understand, currently Data Frames are displayed in their entirety up to 10k rows, after which they are sampled to 10k rows and displayed.

This request is looking for argument to DFViewer, or wherever makes the most sense, to limit the number of rows displayed to some n, where 10k > n >1.

While I understand that its possible to call DFViewer(df.head(10)) to only display 10 rows, this also only provides summary stats over those 10 rows. This request is looking for some behavior like below:

DFViewer(df, max_rows=10)  # only displays 10 rows, show summary stats over entire / sampled df

If this is already possible my apologies.

Appreciative of this great tool!

Pseudo Code Implementation

NA

Prior Art

NA

paddymul commented 5 months ago

Thanks for the interest. BuckarooWidget and PolarsBuckarooWidget have a facility for changing sampling behavior through inheritance. Sampling occurs before summary stats, and before serialization. The python side of Serialization is very slow. In the following code I modified the behavior of DFViewer to accept a widget_klass. I also made an implementation of BuckarooWidget that uses a severely restrictive sampling_klass.

Try this code snippet out.

I will definitely modify the DFViewer function to accept a widget_klass in an upcoming release.

I could add an option for configuring sampling behavor, but for now I'd like to wait. you can write your own utility function to build a sampling_klass and assemble a DFViewer as you see fit. What do you think about ergonomics one way vs the other?

from buckaroo.buckaroo_widget import RawDFViewerWidget, BuckarooWidget
from buckaroo.dataflow.widget_extension_utils import (configure_buckaroo)
from buckaroo.dataflow.dataflow_extras import Sampling

def DFViewer(df,
             column_config_overrides=None,
             extra_pinned_rows=None, pinned_rows=None,
             extra_analysis_klasses=None, analysis_klasses=None,
             widget_klass=BuckarooWidget):
    """
    Display a DataFrame with buckaroo styling and analysis, no extra UI pieces

    column_config_overrides allows targetted specific overriding of styling

    extra_pinned_rows adds pinned_rows of summary stats
    pinned_rows replaces the default pinned rows

    extra_analysis_klasses adds an analysis_klass
    analysis_klasses replaces default analysis_klass
    """
    BuckarooKls = configure_buckaroo(
        widget_klass,
        extra_pinned_rows=extra_pinned_rows, pinned_rows=pinned_rows,
        extra_analysis_klasses=extra_analysis_klasses, analysis_klasses=analysis_klasses)

    bw = BuckarooKls(df, column_config_overrides=column_config_overrides)
    dfv_config = bw.df_display_args['dfviewer_special']['df_viewer_config']
    df_data = bw.df_data_dict['main']
    summary_stats_data = bw.df_data_dict['all_stats']
    return RawDFViewerWidget(
        df_data=df_data, df_viewer_config=dfv_config, summary_stats_data=summary_stats_data)

df = pd.DataFrame({'a':[10, 20, 339, 887], 'b': ['foo', 'bar', None, 'baz']})
#DFViewer(df)

class TwoSample(Sampling):
    pre_limit = 5
    max_columns = 1
    serialize_limit = 2

class TwoBuckaroo(BuckarooWidget):
    sampling_klass = TwoSample
DFViewer(df, widget_klass=TwoBuckaroo)
pep-sanwer commented 5 months ago

Appreciate the speedy response!

I did try out the code snippet you shared, and while it looked promising, I wasn't able to produce the behavior I was looking for. Playing with pre_limit and serialize_limit did limit the amount of displayed rows, but it also altered the behavior of the sampling. In my test case, I have a dataframe with 300 rows, and what I'd like to see is sample stats across the entire dataframe, but showing only the top (by index) 5 and bottom 5 rows, akin to default pandas behavior Just to clarify, I love the current logic of the default dataframe view after import buckaroo - what I'm looking for is to maintain that wonderful logic, but simply display less / a configurable number of rows. Something akin to pandas's pd.options.display.max_rows

Ex:

import pandas as pd

df = pd.DataFrame({"a": range(300), "b": ["c" * i for i in range(300)]})
df

shows image

import polars as pl

pl.from_dataframe(df)

shows image

import buckaroo

df

show all 300 rows, with summary stats over all 300 rows. Desired behavior is to show only top 5 & bottom 5 rows, with summary stats over all 300 rows.

paddymul commented 5 months ago

Other than the ellipsis row this should do what you want. I'd need to think a bit about how to accommodate an ellipsis row. You could just do values, but really you want a row with different styling, which requires a separate release for frontend mods.

Screenshot 2024-03-12 at 10 40 04 AM

paddymul commented 5 months ago

So far as customizing the default display behavior. I love that you want to do this. It's exactly how I want people to use Buckaroo, customize it with their own opinions, and make it do the thing you want by default.

There are a couple of ways to get the behavior that you want, all that will require some dev work on my end.

  1. Customize the implementation of buckaroo.widget_utils.enable. This should accept tuples of (BuckarooKls, dataframeType). Then you could have a one liner that calls enable with your own customized widget. That will work for pandas, it's harder for polars and geopandas, since I have done a bunch of work to keep those dependencies optional
  2. Use some type of customization framework so you could have .buckaroo config file.

Why don't you work on some of the customizations available now, and we'll look at these options in future releases.

BTW, If you're up for it, I'd love to talk to you about how you're using Buckaroo. contact me offline, my info is available in my github profile.

pep-sanwer commented 5 months ago

Thank you so much! I'll definitely follow up with you on this!

paddymul commented 5 months ago

How has this solution been working for you?