pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.34k stars 1.86k forks source link

Unique value count in df.describe() #10270

Open orlp opened 1 year ago

orlp commented 1 year ago

Problem description

I think the number of unique values per column is (often, not always) a very useful property when quickly inspecting a dataset. So I think it would be a good addition to df.describe().

One downside is that this is potentially quite a bit more expensive, so perhaps for very large dataframes it would give a HyperLogLog sketch of the unique value count rather than the exact value. On the other hand we also already give 25%, 50% and 75% quantiles so doing a full sort to then extract the quantiles and unique count in a linear scan is probably fine anyway.

mcrumiller commented 1 year ago

I could see a unique_count row going right after the null_count row perhaps, but I think for any pseudo-continuous dtype like floats or datetimes, this doesn't make too much sense, and it might be confusing for users to see a null in there. Also, is it super useful to have in a df.describe()?

orlp commented 1 year ago

but I think for any pseudo-continuous dtype like floats or datetimes, this doesn't make too much sense, and it might be confusing for users to see a null in there

I don't think we'd put a null there, just a unique count that is likely very close to the total number of rows for those data types.

Also, is it super useful to have in a df.describe()?

I think "how many unique values does it have" is definitely one of the first tidbits of information you'd like to know about a column in almost any dataset. Even for things you'd expect to be continuous it can be a useful data point to see that it indeed has many different values, and a good anomaly detection/eyebrow raising moment when it has a very low amount of distinct values.

alexander-beedie commented 1 year ago

and a good anomaly detection/eyebrow raising moment when it has a very low amount of distinct values.

For some reason that reminds me of a time I was parsing an internally generated ~500MB CSV (back in the days when I primarily wrote C++) and discovered it was blowing up because the originating system had stuck several million NULL chars in the middle of it; when you find yourself breaking out a hex editor to stare into the depths of a CSV you know things have gone sideways! πŸ™„

If this can be done efficiently it sounds like a good addition πŸ‘

stevenlis commented 1 year ago

Maybe limit it to just int, str, and cat, or just str, cat? and default to unique_count = False? I would suggest adding skewness and kurtosis as well πŸ˜‚. Both already have an Expr method.

cjackal commented 1 year ago

It might be good to prepare some curated sets of statistics and reveal as verbosity=0,1,2,... or similar? Then less used or costly stats can go to verbosity>0 and user can choose whether to run them interactively and incrementally.

orlp commented 1 year ago

I don't think incrementally would work sadly, as we return our result simply as a dataframe.

cgebbe commented 11 months ago

Even if it's not in describe(), what's the most efficient way to get this?

{c: df[c].n_unique() for c in sorted(df.columns)}

works, but I doubt this is the most efficient solution as it doesn't parallelize across columns.

mcrumiller commented 11 months ago

@cgebbe the problem is that each column probably has different unique counts, so you can't return it as as a single frame. You could try listifying:

import polars as pl

df = pl.DataFrame({
    'a': [1, 2, 3, 4], # 4 unique elements
    'b': [1, 2, 3, 3]} # 3 unique elements
)

# get a dictionary which will return a series of single-item values
d = df.select(pl.all().implode().list.unique()).to_dict()
d = {k: v.to_list()[0] for k, v in d.items()}
print(d)
{'a': [1, 2, 3, 4], 'b': [1, 2, 3]}

Edit: I just realized you wanted n_unique. This is much easier:

{k: v[0] for k, v in df.select(pl.all().n_unique()).to_dict().items()}
{'a': 4, 'b': 3}
mud2monarch commented 1 month ago

Would love to see this enhancement added. I think it most intuitively goes after count, but after null_count is about the same to me.

I think it would be a mistake to distinguish provision by datatype! Here's a sample use-case:

I am concatenating three time-series of daily asset prices, per-exchange from a REST API. I believe I am requesting, cleaning, then concatenating 3x 365 rows, but my final data is showing 1077 records 🀦. I want to know if the API is providing bad data or if my code is bad (likely 🀣).

My schema is as follows:

β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dt  ┆ daily_price ┆ exchange β”‚
β”‚ --  ┆ ----------- ┆ -------- β”‚
β”‚ Date┆ f64         ┆ str      β”‚
β•žβ•β•β•β•β•β•ͺ═════════════β•ͺ══════════║
β”‚ ... ┆ ...         ┆ ...      β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

df.describe() could show:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ statistic     ┆ dt        ┆ daily_price ┆ exchange  β”‚
β”‚ str           ┆ str       ┆ str         ┆ str       β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════════β•ͺ═════════════β•ͺ═══════════║
β”‚ count         ┆ "1077"    ┆ 1077.0      ┆ "1077"    β”‚
β”‚ uniques_count ┆ "359"     ┆ 1077.0      ┆ "3"       β”‚
β”‚ null_count    ┆ "0"       ┆ 0.0         ┆ "0"       β”‚
β”‚ ...           ┆ ...       ┆ ...         ┆ ...       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And based on this information, it seems like my API is providing me with 359 days of data rather than 365 days of data, for some reason.