pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.76k stars 17.62k forks source link

ENH: Implement dplyr::glimpse() in pandas #51668

Open Holer90 opened 1 year ago

Holer90 commented 1 year ago

Feature Type

Problem Description

Pandas is missing a quick and easy way to get an overview of multi-column data. Fortunate, the R-community has found a solution: dplyr::glimpse(). Link to dplyr.

Example:

>>> iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
>>> iris.glimpse()
DataFrame with 150 rows and 5 columns.
sepal_length  <float64>  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5 ...
sepal_width   <float64>  3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3 ...
petal_length  <float64>  1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1 ...
petal_width   <float64>  0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0 ...
species       <object>   'setosa', 'setosa', 'setosa', 'setosa', 'setosa', ' ...

Feature Description

I have implemented the glimpse() function based on the info() function for both DataFrame and Series. I have also slightly extended the functionality to include the following options:

Parameters
----------
index : bool, optional
    Whether to print the column indices.
dtype : bool, optional
    Whether to print the dtypes of the columns.
isna : bool, optional
    Whether to print the null counts of the columns.
notna : bool, optional
    Whether to print the non-null counts of the columns.
nunique: bool, optional
    Whether to print the number of unique values.
unique_values: bool, optional
    Whether to print a glimpse of the unique values instead of the first values.
verbose : bool, optional
    Whether to print the headers and count descriptions. By default,
    the setting goes to false if only dtype is enabled otherwise it
    goes to true.
emphasize: bool, optional
    Whether to emphasize the optional information columns. By 
    default, it is enabled if verbose is false.
buf : writable buffer, defaults to sys.stdout
    Where to send the output. By default, the output is printed to
    sys.stdout. Pass a writable buffer if you need to further
    process the output.
width : int, optional
    The width at which the output is trimmed. By default, the width
    is determined by the pandas display.width option.   

An example of the extended functionality:

>>> iris.glimpse(unique_values=True, isna=True, notna=True, width=100)
DataFrame with 150 rows and 5 columns.
Column        Dtype    Null    Non-null      Unique values                                          
------        -----    ----    --------      -------------                                          
sepal_length  float64  0 null  150 non-null  5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.4, 4.8, 4.3, 5.8, 5 ...
sepal_width   float64  0 null  150 non-null  3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 2.9, 3.7, 4.0, 4 ...
petal_length  float64  0 null  150 non-null  1.4, 1.3, 1.5, 1.7, 1.6, 1.1, 1.2, 1.0, 1.9, 4.7, 4 ...
petal_width   float64  0 null  150 non-null  0.2, 0.4, 0.3, 0.1, 0.5, 0.6, 1.4, 1.5, 1.3, 1.6, 1 ...
species       object   0 null  150 non-null  'setosa', 'versicolor', 'virginica'                    

Alternative Solutions

The functionality could be implemented in a separate package and monkey-patched into pandas, but this solution would not make the function easily accessible to the vast majority of people using pandas.

Additional Context

I will provide a pull request implementing this functionality shortly.

In siuba, which is a dplyr implementation in python, there is an open issue to Support glimpse function, which shows the desire for this functionality in the python/pandas community.

Edit: The glimpse function is also implemented in polars, which also highlights the desire for this functionality.

Holer90 commented 1 year ago

take

phofl commented 1 year ago

Hi, thanks for your report. Please wait for consensus before submitting a pr

Holer90 commented 1 year ago

Hi, thanks for your report. Please wait for consensus before submitting a pr

Will do.

For reference, the code is (mostly) available in pandas/io/formats/glimpse.py in my fork if its interesting while considering the consensus.

pourmoayed commented 1 year ago

I think this would be a great feature for pandas. In R it gives helpful data summary overview for R DataFrames and it makes sense to have a similar feature for pandas.

chriscardillo commented 1 year ago

Plus one.

Originally opened the issue in the siuba repo. Would be great to see this added here.

phofl commented 1 year ago

Just a general comment: It's not only about the feature, we have to be comfortable maintaining it as well (long-term speaking)

Holer90 commented 1 year ago

Just a general comment: It's not only about the feature, we have to be comfortable maintaining it as well (long-term speaking)

Fully understand. Regarding this, it has been designed with an architecture that is 1-to-1 with the info() function, which should make it easier to both maintain and understand.

cheTesta commented 1 year ago

Isnt't this the same of doing df.T or in full df.transpose() ?

Holer90 commented 1 year ago

Isnt't this the same of doing df.T.head().T # or df.transpose.head() ?

Would that not only print the first 5 columns? Also, this would print/show all the data?

Holer90 commented 1 year ago

@phofl has any discussion happened regarding this feature ?

JustinKurland commented 5 months ago

Late to the show here @Holer90 but I have written a .glimpse function in the pytimetk package that does this just like with dplyr. The issue with the polars implementation of .glimpse() is that if you transform your pandas.DataFrame into a polars.DataFrame the dtypes are not like for like.