pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.72k stars 1.9k forks source link

Ruleset and concistency for poisoning and sorting behavior of float values #5359

Closed sorhawell closed 1 year ago

sorhawell commented 1 year ago

Problem description

Motivation and aim:

Methods and Experiments:

Poisoning

To test the current poisoning, I have created 11 columns as different mixes of (fin)finite, (inf)infinite, (nan)NaN, (nul)Null. Every column is named by it's content and the prefix name is the expected winner of the poisoning battle! The values battle in 13 different functions, hence 13 result columns.

Sorting

To test current sorting, I have created a column of (Null, NaN, -inf, inf, 1.0) and the reverse. The two columns will be sorted by 4 functions, hence 8 result columns.

Code

The following code was executed in polars '0.14.23'

import polars as pl 
from functools import reduce

# Test columns for poisoning
# Define columns as diffrent mixes of (fin)finite, (inf)infinite, (nan)NaN, (nul)Null
# The prefix value should as thumbrule win the poison battle!
# There are reasonalbe exemptions for some functions.
tst_cols = pl.DataFrame({
  "fin" : [1.0]*7,
  "inf" : [float("-inf")] *4 + [float("inf")] *3,
  "nan" : [float("nan")] *7 ,
  "nul" : [None] *7 ,
  "inf_fin" : [float("-inf")] + [1.0]*5 + [float("inf")],
  "nan_fin" : [float("nan")] + [1.0]*5 + [float("nan")],
  "nul_fin" : [None] + [1.0]*5 + [None],
  "nan_inf_fin" : [float("nan")] +[float("-inf")] + [1.0]*3 + [float("inf")] + [float("nan")],
  "nul_nan_fin" : [None] +[float("nan")] + [1.0]*3 + [float("nan")] + [None],
  "nul_inf_fin" : [None] +[float("-inf")] + [1.0]*3 + [float("inf")] + [None],
  "nul_nan_inf_fin" : [None] +[float("nan")] +[float("-inf")] +
    [1.0]*1 + [float("inf")] +[float("nan")] + [None],
})

# Test columns for sorting
tst_cols3 = pl.DataFrame({
  "floats" : [None] +[float("nan")] +[float("-inf")]+ [float("inf")] + [1.0],
}).with_columns([
  pl.all().reverse().alias("REV") # add reverse, should produce same result
])

# Test poisoning behaviour on most relevant functions. rolling_x function could be included also
test_rank_poison = [  
  #scalar:  poison
  tst_cols.select([pl.lit("sum"),pl.all().sum(),]),
  tst_cols.select([pl.lit("mean"),pl.all().mean(),]),
  tst_cols.select([pl.lit("product"),pl.all().product(),]),
  tst_cols.select([pl.lit("std"),pl.all().std(),]),
  tst_cols.select([pl.lit("var"),pl.all().var(),]),

  #cumulative: poison
  tst_cols.select([pl.lit("cumsum"),pl.all().cumsum().tail(1),]),
  tst_cols.select([pl.lit("cumprod"),pl.all().cummax().tail(1),]),

  #scalar: ranking + poison
  tst_cols.select([pl.lit("min"),pl.all().min(),]),
  tst_cols.select([pl.lit("max"),pl.all().max(),]),
  tst_cols.select([pl.lit("arg_min_take"),pl.all().take(pl.all().arg_min()),]),
  tst_cols.select([pl.lit("arg_max_take"),pl.all().take(pl.all().arg_max()),]),

  #cumulative: ranking + poision
  tst_cols.select([pl.lit("cummin"),pl.all().cummin().tail(1),]),
  tst_cols.select([pl.lit("cummax"),pl.all().cummax().tail(1),]),

]

# Test sorting behaviour on relevant functions
test_sort = [  
  #scalar:  poison
  tst_cols3.select([pl.all().sort().suffix("_sort"),]),
  tst_cols3.select([pl.all().take(pl.all().arg_sort()).suffix("_take_arg_sort"),]),
  #tst_cols3.select([pl.all().sort_by(pl.all()).tail(4).suffix("_sort_by"),]),
  tst_cols3.select([pl.all().top_k(5,reverse=True).suffix("_top_k_rev"),]),
  #only works when all values are unique
  tst_cols3.select([pl.all().take(pl.all().rank("max")-1).suffix("_take_rank"),]), 

]

#support for stacking
def vstack_f(df1,df2) :
  return df1.vstack(df2)
def hstack_f(df1,df2) :
  return df1.hstack(df2)

#tweak pretty print
pl.Config.set_tbl_cols(tst_cols.width+1).set_tbl_rows(len(test_rank_poison)).set_fmt_str_lengths(15)

#print results
print(reduce(vstack_f,test_rank_poison))
pl.Config.set_fmt_str_lengths(50)
print(reduce(hstack_f,test_sort).with_row_count())

Results + Discussion:

Poisoning:

image

There is not general trend of the Null > NaN > Inf > finite hierarchy (PH). If there would be any columns should have the same value in all rows. I will now discuss pro and con for divations from PH and in consistency across functions.

sum(), mean(), std(), var():

Conclusion

There could be some individual good reasons why PH the Poisoning hierarchy is not a perfect rule. However, I personally fail to justify all deviations. Related functions can have quite different behavior arg_min does not necessarily point to the min value, which I find confusing. I would define min() as the value arg_min() points to, and I would define arg_min() to point to the first value if the column was sorted ascending. Likewise for max() and arg_max() and sort descending. top_k() and rank() seems to have their own implementation with their own sorting /ranking. top_k() is not invariant in respect to mutation of input column! The overall behavior landscape to day is quite diverse. If the conclusion would be "keep calm, it is all according to math", then some behavior tables like this could be useful in docs and userguides.

sorhawell commented 1 year ago

Litterature search:

NumpPy NEP 2011 proposes to implement missing values.

Mark Wiebe's NEP to unify bit-pattern and bit-masking encodings for missing values in numpy and make numpy more logically sound.

This NEP is truly a great discussion!!!

He proposes to let Numpy adhere to Kleen logics system per default. Describes the background, change of behavior and outlines implementation.

Sidenotes: R adheres to Kleen logics mostly, polars does for boolean Series , but not for aggregation functions as sum and mean. I believe Kleen logic is the exact ruleset, behind my thumbrule of the poisoning hierarchy.

kleen-priest logics in a nutshell

"In Kleen logic, the knowledge of whether any particular unknown state secretly represents true or false at any moment in time is not available. However, certain logical operations can yield an unambiguous result, even if they involve an unknown operand. For example, because true OR true equals true, and true OR false also equals true, one can infer that true OR unknown equals true, as well."

image

Kleen logics for polars would imply that sum should return(->) the following values:

pl.Series([1.0,None]).sum() -> pl.Series([None]).cast(pl.Boolean) #returns 1.0 today
pl.Series([1.0,None]).drop_nulls().sum() -> pl.Series([1.0]) #this should be used to ignore missings
pl.Series([1.0,None]).sum(skipna=True) -> pl.Series([1.0]) # or functions should have a skipna argument

# Polars NaN behavior for sum is equivalent with Kleen logic.
pl.Series([1.0,float("NaN")]).sum() -> pl.Series([float("NaN")]) #does that today
pl.Series([1.0,NaN,None]).sum() -> pl.Series([float("NaN")]) #does that today

# A more spicy implication of Kleen logics is that; "it is unknown if some unknown is equal to some other unknown".
pl.select(pl.lit(None),pl.lit(None)) -> pl.select(pl.lit(None))
#in polars (paraphrased) Null == Null -> False
#in R NA == NA -> NA #Kleen yes
#in python None == None -> None #Kleen no

I imagine uprooting the (Null == Null -> False) rule could have deep implications for polars. I don't think it is very important to follow Kleen logic here.

(Aftermath of Marks "missing data" NEP :)[https://www.mail-archive.com/numpy-discussion@scipy.org/msg32268.html] Mailing list were very positive for NEP. However the likely show-stoppers were:

Sidenote: "Since polars adopted the bit-mask encoding and have made really performant and do not yet gurantee backwards compatability it seems this showstoppers should not apply here"

other readings

discussion of the history NA encoding in python numpy, pandas and R, with citations from Pandas and numpy core teams:

https://www.residentmar.io/2016/06/12/null-and-missing-data-python.html

synopsis:

R had bitpattern missing-val encoding from very early, but this not the norm. Python does not support this. Standard Numpy niether. The most concistent way to support missingness is with external bitmask. Pandas use internally a NaN for missing value. Numpy tried to introduce external bitmask, but it was slower and not widely adopted.

R docs on NA(polars Null) and NaN

https://stat.ethz.ch/R-manual/R-devel/library/base/html/NA.html

synopsis:

The norm is NA poisons, however not always if NaN is there. "Numerical computations using NA will normally result in NA: a possible exception is where NaN is also involved" NaN behavior is not guranteed concistent across platforms and CPU architectures. (see why in What Every Computer Scientist Should Know about Floating-Point Arithmetic)

"What Every Computer Scientist Should Know about Floating-Point Arithmetic"

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html did not read it all.

synopsis:

There are two types of NaNs, the quiet(qNaN) and the signalling(sNAN). The signalling sNaN is intended to bouble up as a computation error. Overflow errors have some times used this one. The qNaN caputures no error but just that result happens to mathmatically be a NaN. Downstream operations con convert a qNaN. It seems there have been decades of disagreement on how much NaNs should carry over/poison :) Encoding qNaN and sNAN have been standardized in IEEE751-year? but may very across platforms.

sorhawell commented 1 year ago

polars mostly emulate SQL+Spark

Polars do emulate SQL+Spark functions such as sum and mean because dropping all Nullvalues. Polars do not emulate SQL+Spark as count() because it is not dropping any Null values.

This lead to the following contradiction that "mean" is not the same as "sum"/"count":

pl.DataFrame((pl.Series([1.0,None]))).select([
    ...: pl.all().mean().alias("mean()"),
    ...: (pl.all().sum() / pl.all().len()).alias("sum()/count()"),
    ...: (pl.all().sum() / pl.all().count()).alias("sum()/len()"),
    ...: ])

shape: (1, 3)
┌────────┬───────────────┬─────────────┐
│ mean() ┆ sum()/count() ┆ sum()/len() │
│ ---    ┆ ---           ┆ ---         │
│ f64    ┆ f64           ┆ f64         │
╞════════╪═══════════════╪═════════════╡
│ 1.0    ┆ 0.5           ┆ 0.5         │
└────────┴───────────────┴─────────────┘
SQL implementations avoids contradiction by defining "count" as number of non-Null values. However this do make the language a bit quirky to get the length of a [column](https://stackoverflow.com/questions/1271810/counting-null-and-non-null-values-in-a-single-query).

## Suggestion:
Polars aggregation functions are equivalent with Spark and SQL variants.
Polars should let .count() exclude `Nulls` and let .len() include `Nulls`

alternatively a defaulted skipnull argument
ritchie46 commented 1 year ago

polars mostly emulate SQL+Spark

Polars do emulate SQL+Spark functions such as sum and mean because dropping all Nullvalues. Polars do not emulate SQL+Spark as count() because it is not dropping any Null values.

This lead to the following contradiction that "mean" is not the same as "sum"/"count":

pl.DataFrame((pl.Series([1.0,None]))).select([
    ...: pl.all().mean().alias("mean()"),
    ...: (pl.all().sum() / pl.all().len()).alias("sum()/count()"),
    ...: (pl.all().sum() / pl.all().count()).alias("sum()/len()"),
    ...: ])

shape: (1, 3)
┌────────┬───────────────┬─────────────┐
│ mean() ┆ sum()/count() ┆ sum()/len() │
│ ---    ┆ ---           ┆ ---         │
│ f64    ┆ f64           ┆ f64         │
╞════════╪═══════════════╪═════════════╡
│ 1.0    ┆ 0.5           ┆ 0.5         │
└────────┴───────────────┴─────────────┘
SQL implementations avoids contradiction by defining "count" as number of non-Null values. However this do make the language a bit quirky to get the length of a [column](https://stackoverflow.com/questions/1271810/counting-null-and-non-null-values-in-a-single-query).

## Suggestion:
Polars aggregation functions are equivalent with Spark and SQL variants.
Polars should let .count() exclude `Nulls` and let .len() include `Nulls`

alternatively a defaulted skipnull argument

That's something I understand! :)

I fully agree on this one and will see we can add a len expression and that count ignores the missing values.

sorhawell commented 1 year ago

Polars, SQL and the contradictions in their logical systems.

TLDR + Conclusion: Polars relies on SQL relational algebra for optimization. SQL and therefore also Spark and Polars have some logical contradictions with Null and smart people have argued about it the last 4 decades. It is not obvious any easy resolution is available. It is practical that Polars behaves quite similar to Spark and SQL. So I suggest no major changes.

According to [grant] the creators of SQL introduces tri-value truths True, False, Unknown as an afterthought in 1970's. Boolean truthtables do match Kleen-logics(see above mentioning) and both SQL and polars implement this for boolean operators. In Kleen logics: The Null means missing data for now, which could be resolved in the future. However, this is not the interpretation applied in SQL. Here the Null means inapplicable as in not there. That explains why SQL mean-function just drops the Null as they never really were anything. This interpretation is not the same as the Kleen logic. In Kleen logic, missing values should not be omitted from computations. In Kleen logic, a missing value which leads to ambiguity must propagate(poison) and if no ambiguity, the missing value disappears. 'No ambiguities' means that no matter what concrete value a missing could resolve to in future, the result is unchaged. So 2 + Null -> Null, median(1,1,1,1,Null)-> 1, True Or Null -> True

There have been many debates [Grant, Rubinson, Tessaris] whether this leaves SQL logically sound. Probably not completely, practically close with the right mindset. There are good suggestion of how to deal with SQL/Polars no matter what interpretation of missingness and how to think of it [Rubinson].

With few exceptions in sorting, it appears to me that R implements Kleen logic and above NEP by Mark Wiebe was suggestion in the same spirit.

I did not yet found any Kleen-concistent relational calculus/implementation that has optmizations. Maybe it because it is more complicated to resolve.

[grant - https://dl.acm.org/doi/abs/10.1145/1462571.1462575] [rubinson - https://journals.sagepub.com/doi/10.1177/0894439314521988] [Tessaris et al, https://arxiv.org/pdf/2202.10898.pdf]