Closed sorhawell closed 1 year ago
This NEP is truly a great discussion!!!
He proposes to let Numpy adhere to Kleen logics system per default. Describes the background, change of behavior and outlines implementation.
Sidenotes: R adheres to Kleen logics mostly, polars does for boolean Series , but not for aggregation functions as sum and mean. I believe Kleen logic is the exact ruleset, behind my thumbrule of the poisoning hierarchy.
kleen-priest logics in a nutshell
"In Kleen logic, the knowledge of whether any particular unknown state secretly represents true or false at any moment in time is not available. However, certain logical operations can yield an unambiguous result, even if they involve an unknown operand. For example, because true OR true equals true, and true OR false also equals true, one can infer that true OR unknown equals true, as well."
Kleen logics for polars would imply that sum should return(->) the following values:
pl.Series([1.0,None]).sum() -> pl.Series([None]).cast(pl.Boolean) #returns 1.0 today
pl.Series([1.0,None]).drop_nulls().sum() -> pl.Series([1.0]) #this should be used to ignore missings
pl.Series([1.0,None]).sum(skipna=True) -> pl.Series([1.0]) # or functions should have a skipna argument
# Polars NaN behavior for sum is equivalent with Kleen logic.
pl.Series([1.0,float("NaN")]).sum() -> pl.Series([float("NaN")]) #does that today
pl.Series([1.0,NaN,None]).sum() -> pl.Series([float("NaN")]) #does that today
# A more spicy implication of Kleen logics is that; "it is unknown if some unknown is equal to some other unknown".
pl.select(pl.lit(None),pl.lit(None)) -> pl.select(pl.lit(None))
#in polars (paraphrased) Null == Null -> False
#in R NA == NA -> NA #Kleen yes
#in python None == None -> None #Kleen no
I imagine uprooting the (Null == Null -> False) rule could have deep implications for polars. I don't think it is very important to follow Kleen logic here.
(Aftermath of Marks "missing data" NEP :)[https://www.mail-archive.com/numpy-discussion@scipy.org/msg32268.html] Mailing list were very positive for NEP. However the likely show-stoppers were:
Sidenote: "Since polars adopted the bit-mask encoding and have made really performant and do not yet gurantee backwards compatability it seems this showstoppers should not apply here"
https://www.residentmar.io/2016/06/12/null-and-missing-data-python.html
R had bitpattern missing-val encoding from very early, but this not the norm. Python does not support this. Standard Numpy niether. The most concistent way to support missingness is with external bitmask. Pandas use internally a NaN for missing value. Numpy tried to introduce external bitmask, but it was slower and not widely adopted.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/NA.html
The norm is NA poisons, however not always if NaN is there. "Numerical computations using NA will normally result in NA: a possible exception is where NaN is also involved" NaN behavior is not guranteed concistent across platforms and CPU architectures. (see why in What Every Computer Scientist Should Know about Floating-Point Arithmetic)
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html did not read it all.
There are two types of NaNs, the quiet(qNaN) and the signalling(sNAN). The signalling sNaN is intended to bouble up as a computation error. Overflow errors have some times used this one. The qNaN caputures no error but just that result happens to mathmatically be a NaN. Downstream operations con convert a qNaN. It seems there have been decades of disagreement on how much NaNs should carry over/poison :) Encoding qNaN and sNAN have been standardized in IEEE751-year? but may very across platforms.
Polars do emulate SQL+Spark functions such as sum
and mean
because dropping all Null
values.
Polars do not emulate SQL+Spark as count() because it is not dropping any Null
values.
This lead to the following contradiction that "mean" is not the same as "sum"/"count":
pl.DataFrame((pl.Series([1.0,None]))).select([
...: pl.all().mean().alias("mean()"),
...: (pl.all().sum() / pl.all().len()).alias("sum()/count()"),
...: (pl.all().sum() / pl.all().count()).alias("sum()/len()"),
...: ])
shape: (1, 3)
┌────────┬───────────────┬─────────────┐
│ mean() ┆ sum()/count() ┆ sum()/len() │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════════╪═════════════╡
│ 1.0 ┆ 0.5 ┆ 0.5 │
└────────┴───────────────┴─────────────┘
SQL implementations avoids contradiction by defining "count" as number of non-Null values. However this do make the language a bit quirky to get the length of a [column](https://stackoverflow.com/questions/1271810/counting-null-and-non-null-values-in-a-single-query).
## Suggestion:
Polars aggregation functions are equivalent with Spark and SQL variants.
Polars should let .count() exclude `Nulls` and let .len() include `Nulls`
alternatively a defaulted skipnull argument
polars mostly emulate SQL+Spark
Polars do emulate SQL+Spark functions such as
sum
andmean
because dropping allNull
values. Polars do not emulate SQL+Spark as count() because it is not dropping anyNull
values.This lead to the following contradiction that "mean" is not the same as "sum"/"count":
pl.DataFrame((pl.Series([1.0,None]))).select([ ...: pl.all().mean().alias("mean()"), ...: (pl.all().sum() / pl.all().len()).alias("sum()/count()"), ...: (pl.all().sum() / pl.all().count()).alias("sum()/len()"), ...: ]) shape: (1, 3) ┌────────┬───────────────┬─────────────┐ │ mean() ┆ sum()/count() ┆ sum()/len() │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞════════╪═══════════════╪═════════════╡ │ 1.0 ┆ 0.5 ┆ 0.5 │ └────────┴───────────────┴─────────────┘ SQL implementations avoids contradiction by defining "count" as number of non-Null values. However this do make the language a bit quirky to get the length of a [column](https://stackoverflow.com/questions/1271810/counting-null-and-non-null-values-in-a-single-query). ## Suggestion: Polars aggregation functions are equivalent with Spark and SQL variants. Polars should let .count() exclude `Nulls` and let .len() include `Nulls` alternatively a defaulted skipnull argument
That's something I understand! :)
I fully agree on this one and will see we can add a len expression and that count ignores the missing values.
TLDR + Conclusion: Polars relies on SQL relational algebra for optimization. SQL and therefore also Spark and Polars have some logical contradictions with
Null
and smart people have argued about it the last 4 decades. It is not obvious any easy resolution is available. It is practical that Polars behaves quite similar to Spark and SQL. So I suggest no major changes.
According to [grant] the creators of SQL introduces tri-value truths True, False, Unknown as an afterthought in 1970's. Boolean truthtables do match Kleen-logics(see above mentioning) and both SQL and polars implement this for boolean operators.
In Kleen logics: The Null
means missing data for now, which could be resolved in the future. However, this is not the interpretation applied in SQL. Here the Null
means inapplicable as in not there. That explains why SQL mean-function just drops the Null
as they never really were anything. This interpretation is not the same as the Kleen logic. In Kleen logic, missing values should not be omitted from computations. In Kleen logic, a missing value which leads to ambiguity must propagate(poison) and if no ambiguity, the missing value disappears. 'No ambiguities' means that no matter what concrete value a missing could resolve to in future, the result is unchaged. So 2 + Null -> Null, median(1,1,1,1,Null)-> 1, True Or Null -> True
There have been many debates [Grant, Rubinson, Tessaris] whether this leaves SQL logically sound. Probably not completely, practically close with the right mindset. There are good suggestion of how to deal with SQL/Polars no matter what interpretation of missingness and how to think of it [Rubinson].
With few exceptions in sorting, it appears to me that R implements Kleen logic and above NEP by Mark Wiebe was suggestion in the same spirit.
I did not yet found any Kleen-concistent relational calculus/implementation that has optmizations. Maybe it because it is more complicated to resolve.
[grant - https://dl.acm.org/doi/abs/10.1145/1462571.1462575] [rubinson - https://journals.sagepub.com/doi/10.1177/0894439314521988] [Tessaris et al, https://arxiv.org/pdf/2202.10898.pdf]
Problem description
Motivation and aim:
drop_nulls() drop_nans() drop_infinite() keep_finite()
or otherwise any poisoning function such asExpr.mean()
could have defaulted arguments to modify the behavior.Methods and Experiments:
Poisoning
To test the current poisoning, I have created 11 columns as different mixes of (fin)finite, (inf)infinite, (nan)NaN, (nul)Null. Every column is named by it's content and the prefix name is the expected winner of the poisoning battle! The values battle in 13 different functions, hence 13 result columns.
Sorting
To test current sorting, I have created a column of (Null, NaN, -inf, inf, 1.0) and the reverse. The two columns will be sorted by 4 functions, hence 8 result columns.
Code
The following code was executed in polars '0.14.23'
Results + Discussion:
Poisoning:
There is not general trend of the
Null > NaN > Inf > finite
hierarchy (PH). If there would be any columns should have the same value in all rows. I will now discuss pro and con for divations from PH and in consistency across functions.sum(), mean(), std(), var():
pro: a equation like (Inf + -Inf) should yield NaN, and any missing value Null, could not change that result. So deviation from PH is largely logically consistent.
con: "nul_fin" is surprising not sure how missing + value can become a value
product():
cumsum()
cumprod
min() max() arg_min().take() arg_max.take()
cummin() cummax()
Sorting:
sort() take(arg_sort())
pro: Has a concistent ranking where Null is less than -Inf and NaN is more than Inf. Revese input gives same output.
con: if arg reverse=True, then Null becomes a value above NaN, except if null_last=True (this behavior is not shown in this test) Sort thereby hints that Null is a non sortable value.
top_k()
pro: -
con: NaN and null are swapped inrespect to sort() and arg_sort(), Cannot change if NaNs or Nulls should be first or last.
take(rank("max")-1)
pro: -
con: Different result for reversed input. Strange results.
Conclusion
There could be some individual good reasons why PH the Poisoning hierarchy is not a perfect rule. However, I personally fail to justify all deviations. Related functions can have quite different behavior arg_min does not necessarily point to the min value, which I find confusing. I would define min() as the value arg_min() points to, and I would define arg_min() to point to the first value if the column was sorted ascending. Likewise for max() and arg_max() and sort descending. top_k() and rank() seems to have their own implementation with their own sorting /ranking. top_k() is not invariant in respect to mutation of input column! The overall behavior landscape to day is quite diverse. If the conclusion would be "keep calm, it is all according to math", then some behavior tables like this could be useful in docs and userguides.