Closed walkerdustin closed 2 years ago
Thanks for the request but we've gotten similar requests in the past and generally they have been deemed out of scope. Since analysis can be a very domain or practice specific process, pandas just provides more fundamental building blocks like head
or describe
for anyone to build more tailored functions and solutions. I would be -1 on having pandas maintain this functionality
i believe pandas-profiling does almost exactly this
so -1 as well
Since pandas-profiling
already fills this needs, closing as out of scope of pandas
Thank you for the hint to pandas-profiling
.
What I proposed is a bit different, as I want a more compact description.
But I totally understand your point.
It was a pleasure open sourcing with you.
Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue? Thanks.
Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue?
Yes, very clear description of the request!
Universal visualisation for Data-Science Dataset
For the past years, I have been learning Data-Science. I have gone through loads of datasets. I have recently developed a Repository with 25 examples for data exploration, understanding, and machine learning. We have noticed, that there is not really a universal way of showing the content of a dataset. Currently, you can use
df.head()
df.info()
df.describe()
df.head()
showing the first few rows, gives some good insight into the dataset, but it only shows a slice of it, and it takes some brain juice and good focus to interpret correctly.df.info()
shows you column name, Non-Null Count and Dtype. This is some very useful metadata, but doesn't give you any insights into the contents. Also labelling stings as Dtype "object" is correct, but not very helpful.df.describe()
gives very helpful information into continuous values, but does not help with strings and categorical values.When I search for new datasets to learn Data-Science, I want to understand what the data is about and what I may be able to do with it with just one visualisation.
Additional Benefit
When exploring a new dataset, the first steps are always the same: I want to know what features I am working with, and what Datatypes do they have? For categorical features, I want a list of the categories
df["feature"].unique()
. For continuous values, I want to know the range ( min, max) and maybe the mean.The Solution: A universal visualisation.
Wouldnt it be amazing if
pandas
could print a table with just one function call, that describes all this information in a compact easy to understand format. It would automatically detect categorical and continuous values and provide the most important information, to quickly understand what the data is about.This table could become the default visualisation, to efficiently describe a dataset. It can be printed out directly in markdown table format, so that it can be directly copied into the documentation
API breaking implications
This would be a new function:
pandas.DataFrame.feature_description()
Describe alternatives you've considered
The current alternative is that everyone uses a custom format to visualize their dataset.
You can use the functions listed above, to get a feel for the dataset. For categorical values, you have to do a
df["feature"].unique()
ordf["feature"].value_counts()
for every single feature.Now your information is scattered all over your python notebook, and you are constantly scrolling around to the different cells
Sample Implementation
For our ML-Repository (mentioned above) I have implemented a function, that partially fulfils these requirements. I would propose an output like this.
The contents of a feature is described with universal symbols and mathematical notation. The format should be universal and not depend on a specific language. Words like "example: " or "Values from -10 to + 110.4" are not used, to make the Table generally interpretable for people speaking all kinds of languages.
Implementation:
Proposed Parameters
Proposed Parameters