pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.88k stars 18.03k forks source link

ENH: Describe : add shortest, longest, avg/max/min length #59897

Open simonaubertbd opened 2 months ago

simonaubertbd commented 2 months ago

Feature Type

Problem Description

Hello,

As of now, Describe is mainly oriented for numerical analysis. It's less useful when you have text, string values.

Feature Description

Adding five statistics dedicated to string analysis for each concerned column: -avg length -max length -min length -shortest : one of the string with the minimum length -longest : one of the string with the maximum length

Alternative Solutions

writing something like that but that means more work to do (sorry for the formatting) import pandas as pd

Sample DataFrame for illustration

data = { 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'city': ['New York', 'Los Angeles', 'San Francisco', 'Chicago'], 'country': ['USA', 'USA', 'USA', 'USA'] }

df = pd.DataFrame(data)

Function to get string statistics

def string_column_statistics(df): stats = {}

for col in df.select_dtypes(include='object').columns:
    string_lengths = df[col].str.len()

    avg_length = string_lengths.mean()
    max_length = string_lengths.max()
    min_length = string_lengths.min()

    max_length_string = df[col][string_lengths.idxmax()]
    min_length_string = df[col][string_lengths.idxmin()]

    stats[col] = {
        'average_length': avg_length,
        'max_length': max_length,
        'min_length': min_length,
        'example_max_length': max_length_string,
        'example_min_length': min_length_string
    }

return pd.DataFrame(stats)

Call the function

string_stats_df = string_column_statistics(df) print(string_stats_df)

Additional Context

Best regards,

Simon

rhshadrach commented 1 month ago

Thanks for the request. I'm curious about the use cases of wanting to know the min/max/average length of strings. In the examples you give, I view these as labels for which the length of the strings is not particularly important (e.g. What's in a name?).

cc @WillAyd

simonaubertbd commented 1 month ago

@rhshadrach Yeah, the example wasn't exactly a use case example, you're pretty right about that.

Now let's have a few use cases : -financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same. -french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3 -also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer -Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

To add some personal context : I'm an old Alteryx user and it's a feature in their data investigation tools, very common, very useful and I was surprised that describe doesn't cover it. Plus, there is this very nice project, Amphi, that aims to be a visual data preparation/etl tool and that relies on Python and I would like it to incorporate a data investigation tool. Having it all in Describe would definitly help a lot.

Best regards and thanks for your prompt answer to my issue

Simon

rhshadrach commented 1 month ago

In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance. I do not think we should expand on the API of this function for the purposes of data validation.

-also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer -Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).

These seem quite uncommon uses to me.

I am negative on expanding the API here.

simonaubertbd commented 1 month ago

Hello @rhshadrach "In the first two of your examples, it seems to me you wish to validate data. I think describe is meant to output summary statistics, the idea being that a user can get a sense of what data the DataFrame contains at a glance"

Validating data would be another thing, like what happens if the field X doesn"t follow the rule Y. Here, that's more in the spirit : do I have suprises with this dataset or is the data quality good? But it can also help for different purposes like finding the max length of string in order to have the good type when sending it to a database (varchar(10) is not the same than a varchar(32)).

Moreover, the goal of the Panda Describe function is

Generate descriptive statistics.

And when I ask chatgpt about it , here the answer :

Why Generate Descriptive Statistics?

Understanding Data Distribution: Helps you understand the general shape and behavior of your data.
Detecting Outliers: Standard deviation, range, and IQR can help identify extreme values that may require special attention.
Summarizing Large Datasets: Allows you to condense complex datasets into understandable summaries, aiding in decision-making and analysis.
Data Cleaning: Helps detect potential issues like missing values, anomalies, or inconsistent data patterns.

So, the 4th point is not out of the scope, as you can see.

Best regards,

Simon