ENH:Universal visualization for Data-Science Dataset

walkerdustin commented 2 years ago

Universal visualisation for Data-Science Dataset

For the past years, I have been learning Data-Science. I have gone through loads of datasets. I have recently developed a Repository with 25 examples for data exploration, understanding, and machine learning. We have noticed, that there is not really a universal way of showing the content of a dataset. Currently, you can use

df.head()
df.info()
df.describe()

df.head() showing the first few rows, gives some good insight into the dataset, but it only shows a slice of it, and it takes some brain juice and good focus to interpret correctly. df.info() shows you column name, Non-Null Count and Dtype. This is some very useful metadata, but doesn't give you any insights into the contents. Also labelling stings as Dtype "object" is correct, but not very helpful. df.describe() gives very helpful information into continuous values, but does not help with strings and categorical values.

When I search for new datasets to learn Data-Science, I want to understand what the data is about and what I may be able to do with it with just one visualisation.

Additional Benefit

When exploring a new dataset, the first steps are always the same: I want to know what features I am working with, and what Datatypes do they have? For categorical features, I want a list of the categories df["feature"].unique(). For continuous values, I want to know the range ( min, max) and maybe the mean.

The Solution: A universal visualisation.

Wouldnt it be amazing if pandas could print a table with just one function call, that describes all this information in a compact easy to understand format. It would automatically detect categorical and continuous values and provide the most important information, to quickly understand what the data is about.

This table could become the default visualisation, to efficiently describe a dataset. It can be printed out directly in markdown table format, so that it can be directly copied into the documentation

API breaking implications

This would be a new function: pandas.DataFrame.feature_description()

Describe alternatives you've considered

The current alternative is that everyone uses a custom format to visualize their dataset.
You can use the functions listed above, to get a feel for the dataset. For categorical values, you have to do a df["feature"].unique() or df["feature"].value_counts() for every single feature.
Now your information is scattered all over your python notebook, and you are constantly scrolling around to the different cells

Sample Implementation

For our ML-Repository (mentioned above) I have implemented a function, that partially fulfils these requirements. I would propose an output like this.
The contents of a feature is described with universal symbols and mathematical notation. The format should be universal and not depend on a specific language. Words like "example: " or "Values from -10 to + 110.4" are not used, to make the Table generally interpretable for people speaking all kinds of languages.

| Feature         | Data Type |
|-----------------|-----------|
| customerID      |  str     { "5789-LDFXO", ... }   |
| gender          |  str     {"Female", "Male"}   |
| SeniorCitizen   |  int64   |
| Partner         |  str     {"Yes", "No"}   |
| Dependents      |  str     {"No", "Yes"}   |
| tenure          |  int64   |
| PhoneService    |  str     {"No", "Yes"}   |
| MultipleLines   |  str     {"No phone service", "No", "Yes"}   |
| InternetService |  str     {"DSL", "Fiber optic", "No"}   |
| OnlineSecurity  |  str     {"No", "Yes", "No internet service"}   |
| OnlineBackup    |  str     {"Yes", "No", "No internet service"}   |
| DeviceProtection|  str     {"No", "Yes", "No internet service"}   |
| TechSupport     |  str     {"No", "Yes", "No internet service"}   |
| StreamingTV     |  str     {"No", "Yes", "No internet service"}   |
| StreamingMovies |  str     {"No", "Yes", "No internet service"}   |
| Contract        |  str     {"Month-to-month", "One year", "Two year"}   |
| PaperlessBilling|  str     {"Yes", "No"}   |
| PaymentMethod   |  str     {"Electronic check", "Mailed check", "Bank transfer (automatic)", "Credit card (automatic)"}   |
| MonthlyCharges  |  float64 [ 18.25; 118.75 ]   |
| TotalCharges    |  str     { "659.35", ... }   |
| Churn           |  str     {"No", "Yes"}   |

Implementation:

def feature_description(data):
    longestColumnName = len(max(np.array(data.columns), key=len))
    print(f"| {'Feature'.ljust(longestColumnName)}| Data Type |")
    print(f"|{''.join(['-']*( longestColumnName+1))}|-----------|")
    for col in data.columns:
        description = ''
        col_dropna = data[col].dropna()
        example = col_dropna.sample(1).values[0]
        if type(example) == str:
            description = 'str'.ljust(8)
            if len(col_dropna.unique()) < 10:
                description += '{'
                description += ', '.join([ f'"{name}"' for name in col_dropna.unique()])
                description += '}'
            else:
                description += '{ "'+ example + '", ... }'
        elif (type(example) == np.int32) and (len(col_dropna.unique()) < 10) :
            description += 'int32 {'
            description += ', '.join([ f'{name}' for name in sorted(col_dropna.unique())])
            description += '}'
        elif (type(example) == np.float64):
            description += f"{'float64'.ljust(8)}[ {col_dropna.min()}; {col_dropna.max()} ]"
        else:
            try:
                description = example.dtype
            except:
                 description = type(example)
        print("| " + col.ljust(longestColumnName)+ f'|  {description}   |')

feature_description(df)

Proposed Parameters

categorical_limit=10: maximum categories, to be displayed in the categorical notation.
max_displayed_chars_in_string=30: Maximum number of characters displayed in the example and the categorical notation, before being shorted with ...
show_NAN_count=False: count the Non-Null Count, as in df.info()
markdown_format=True: Display the Table in the Markdown format
extended=False: Show more information like value_counts for categorical and standard deviation for continuous values

mroeschke commented 2 years ago

Thanks for the request but we've gotten similar requests in the past and generally they have been deemed out of scope. Since analysis can be a very domain or practice specific process, pandas just provides more fundamental building blocks like head or describe for anyone to build more tailored functions and solutions. I would be -1 on having pandas maintain this functionality

jreback commented 2 years ago

i believe pandas-profiling does almost exactly this

so -1 as well

mroeschke commented 2 years ago

Since pandas-profiling already fills this needs, closing as out of scope of pandas

walkerdustin commented 2 years ago

Thank you for the hint to pandas-profiling. What I proposed is a bit different, as I want a more compact description. But I totally understand your point. It was a pleasure open sourcing with you.

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue? Thanks.

mroeschke commented 2 years ago

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue?

Yes, very clear description of the request!

pandas-dev / pandas