Open orlp opened 1 year ago
I could see a unique_count
row going right after the null_count
row perhaps, but I think for any pseudo-continuous dtype like floats or datetimes, this doesn't make too much sense, and it might be confusing for users to see a null in there. Also, is it super useful to have in a df.describe()
?
but I think for any pseudo-continuous dtype like floats or datetimes, this doesn't make too much sense, and it might be confusing for users to see a null in there
I don't think we'd put a null there, just a unique count that is likely very close to the total number of rows for those data types.
Also, is it super useful to have in a
df.describe()
?
I think "how many unique values does it have" is definitely one of the first tidbits of information you'd like to know about a column in almost any dataset. Even for things you'd expect to be continuous it can be a useful data point to see that it indeed has many different values, and a good anomaly detection/eyebrow raising moment when it has a very low amount of distinct values.
and a good anomaly detection/eyebrow raising moment when it has a very low amount of distinct values.
For some reason that reminds me of a time I was parsing an internally generated ~500MB CSV (back in the days when I primarily wrote C++) and discovered it was blowing up because the originating system had stuck several million NULL chars in the middle of it; when you find yourself breaking out a hex editor to stare into the depths of a CSV you know things have gone sideways! π
If this can be done efficiently it sounds like a good addition π
Maybe limit it to just int, str, and cat, or just str, cat? and default to unique_count = False
? I would suggest adding skewness
and kurtosis
as well π. Both already have an Expr
method.
It might be good to prepare some curated sets of statistics and reveal as verbosity=0,1,2,...
or similar? Then less used or costly stats can go to verbosity>0
and user can choose whether to run them interactively and incrementally.
I don't think incrementally would work sadly, as we return our result simply as a dataframe.
Even if it's not in describe()
, what's the most efficient way to get this?
{c: df[c].n_unique() for c in sorted(df.columns)}
works, but I doubt this is the most efficient solution as it doesn't parallelize across columns.
@cgebbe the problem is that each column probably has different unique counts, so you can't return it as as a single frame. You could try listifying:
import polars as pl
df = pl.DataFrame({
'a': [1, 2, 3, 4], # 4 unique elements
'b': [1, 2, 3, 3]} # 3 unique elements
)
# get a dictionary which will return a series of single-item values
d = df.select(pl.all().implode().list.unique()).to_dict()
d = {k: v.to_list()[0] for k, v in d.items()}
print(d)
{'a': [1, 2, 3, 4], 'b': [1, 2, 3]}
Edit: I just realized you wanted n_unique
. This is much easier:
{k: v[0] for k, v in df.select(pl.all().n_unique()).to_dict().items()}
{'a': 4, 'b': 3}
Would love to see this enhancement added. I think it most intuitively goes after count
, but after null_count
is about the same to me.
I think it would be a mistake to distinguish provision by datatype! Here's a sample use-case:
I am concatenating three time-series of daily asset prices, per-exchange from a REST
API. I believe I am requesting, cleaning, then concatenating 3x 365 rows, but my final data is showing 1077 records π€¦. I want to know if the API is providing bad data or if my code is bad (likely π€£).
My schema is as follows:
βββββββ¬ββββββββββββββ¬βββββββββββ
β dt β daily_price β exchange β
β -- β ----------- β -------- β
β Dateβ f64 β str β
βββββββͺββββββββββββββͺβββββββββββ‘
β ... β ... β ... β
βββββββ΄ββββββββββββββ΄βββββββββββ
df.describe()
could show:
βββββββββββββββββ¬ββββββββββββ¬ββββββββββββββ¬ββββββββββββ
β statistic β dt β daily_price β exchange β
β str β str β str β str β
βββββββββββββββββͺββββββββββββͺββββββββββββββͺββββββββββββ‘
β count β "1077" β 1077.0 β "1077" β
β uniques_count β "359" β 1077.0 β "3" β
β null_count β "0" β 0.0 β "0" β
β ... β ... β ... β ... β
βββββββββββββββββ΄ββββββββββββ΄ββββββββββββββ΄ββββββββββββ
And based on this information, it seems like my API is providing me with 359 days of data rather than 365 days of data, for some reason.
Problem description
I think the number of unique values per column is (often, not always) a very useful property when quickly inspecting a dataset. So I think it would be a good addition to
df.describe()
.One downside is that this is potentially quite a bit more expensive, so perhaps for very large dataframes it would give a HyperLogLog sketch of the unique value count rather than the exact value. On the other hand we also already give 25%, 50% and 75% quantiles so doing a full sort to then extract the quantiles and unique count in a linear scan is probably fine anyway.