Support three-valued logic

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

29.34k stars 1.86k forks source link

Support three-valued logic #7595

Closed CameronBieganek closed 1 year ago

CameronBieganek commented 1 year ago

Polars version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Issue description

Polars does not implement three-valued logic.

Here is a link to a SQLFiddle example showing the expected behavior for the Polars example below: http://sqlfiddle.com/#!17/a30e47/1/0

Reproducible example

In [4]: import polars as pl

In [5]: df = pl.DataFrame({"a": [42], "b": [None]})

In [6]: df.with_columns(
   ...:     a_b_equal = pl.col("a") == pl.col("b"),
   ...:     a_b_not_equal = pl.col("a") != pl.col("b")
   ...: )
Out[6]:
shape: (1, 4)
┌─────┬──────┬───────────┬───────────────┐
│ a   ┆ b    ┆ a_b_equal ┆ a_b_not_equal │
│ --- ┆ ---  ┆ ---       ┆ ---           │
│ i64 ┆ f64  ┆ bool      ┆ bool          │
╞═════╪══════╪═══════════╪═══════════════╡
│ 42  ┆ null ┆ false     ┆ true          │
└─────┴──────┴───────────┴───────────────┘

Expected behavior

In [4]: import polars as pl

In [5]: df = pl.DataFrame({"a": [42], "b": [None]})

In [6]: df.with_columns(
   ...:     a_b_equal = pl.col("a") == pl.col("b"),
   ...:     a_b_not_equal = pl.col("a") != pl.col("b")
   ...: )
Out[6]:
shape: (1, 4)
┌─────┬──────┬───────────┬───────────────┐
│ a   ┆ b    ┆ a_b_equal ┆ a_b_not_equal │
│ --- ┆ ---  ┆ ---       ┆ ---           │
│ i64 ┆ f64  ┆ bool      ┆ bool          │
╞═════╪══════╪═══════════╪═══════════════╡
│ 42  ┆ null ┆ null      ┆ null          │
└─────┴──────┴───────────┴───────────────┘

Installed versions

``` In [7]: pl.show_versions() ---Version info--- Polars: 0.16.13 Index type: UInt32 Platform: Windows-10-10.0.19044-SP0 Python: 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] ---Optional dependencies--- numpy: 1.21.5 pandas: 1.4.4 pyarrow: connectorx: deltalake: fsspec: 2022.11.0 matplotlib: 3.5.3 xlsx2csv: xlsxwriter: 3.0.3```

alexander-beedie commented 1 year ago

The only major domain in which that would be expected is that of SQL/databases - and the behaviour of NULL values is one of the most frequently mistaken things about SQL, as the concept of UNKNOWN is further overloaded with both MISSING and NONE. In other words three-value logic is in some senses closer to five-value logic, except you can't properly distinguish between those last three values[^1]. As someone who has worked with a lot of databases, this has long been a pet peeve of mine 🤣

Truthfully I don't see this happening as it would be extremely surprising behaviour given that neither Python, Rust, nor JS users would expect it, and polars isn't actually a database.

[^1]: A NONE value is definitely not equal to 42. A MISSING value is definitely not equal to 42. An UNKNOWN value might (or might not) be equal to 42. All three of these possibilities are bundled under just "NULL".

CameronBieganek commented 1 year ago

First, let's clarify the terminology.

Notation:
- 2VL = "two-valued logic".
- 3VL = "three-valued logic".
NULL is a marker that means the absence of a value.
- It's challenging to avoid the phrase "null value", but NULL is not a value, it is a marker.
- The phrase "absence of a value" can generally be identified with the word "missing". Let's not get into a debate about kinds of missing---let's just acknowledge two kinds of missing, "Missing But Applicable" and "Missing But Inapplicable".
UNKNOWN is the third boolean value in 3VL.
- UNKNOWN is a boolean value. It is not a marker.
- UNKNOWN is not the same as NULL.
  - Unfortunately, SQL represents UNKNOWN with a boolean NULL marker. 🤦‍♂️
NONE: I don't know what you have in mind for the word NONE, but I don't think it needs to enter into this discussion.

I disagree with your footnote that says "A MISSING value is definitely not equal to 42". The expression <missing> == 42 should return UNKNOWN. That is approximately what happens in SQL, where NULL = 42 evaluates to NULL.

I agree that the best representation of missing data is a problem that still has not been solved by any language, including SQL. I also agree that the SQL implementation of 3VL is flawed. As noted above, it is unfortunate that SQL represents the boolean UNKNOWN by a boolean NULL marker. However, I believe that in practice SQL's 3VL is useful for representing missing data, as long as you are careful about handling NULLs in your queries. In other words, it's better than only having 2VL and not having support for NULL at all.

The point of view that you are espousing is that users should create their own representation of missing data that suits the domain they are working in. That is a valid point of view, but if that's the case, then Polars should not support null markers. If you have null markers, then you should support 3VL. It's one or the other---either have 2VL with no null markers, or have 3VL with null markers. You can't have it both ways. If you have null markers, you should have 3VL.

The only major domain in which that would be expected is that of SQL/databases

That's not true. Pandas does support 3VL, thought it is still marked as experimental:

>>> df = pd.DataFrame({"a": [42, 100], "b": [None, 2]}, dtype="Int64"); df
     a     b
0   42  <NA>
1  100     2
>>> df.a == df.b
0     <NA>
1    False
dtype: boolean
>>> df.a != df.b
0    <NA>
1    True
dtype: boolean

Furthermore, R and Julia natively support 3VL:

> 42 == NA
[1] NA
> 42 != NA
[1] NA

Julia:

julia> 42 == missing
missing

julia> 42 != missing
missing

If SQL, R, Julia, and Pandas all support 3VL, doesn't it seem reasonable for Polars to support 3VL? One of the things I like about Polars is how the semantics are so similar to SQL semantics. However, not supporting 3VL is a radical divergence from SQL. It's a footgun for sure.

CameronBieganek commented 1 year ago

Polars has the opportunity to improve upon the 3VL in SQL by providing a Boolean type that is truly three-valued, with values true, false, and unknown.

CameronBieganek commented 1 year ago

Here's another example:

In [25]: df = pl.DataFrame(
    ...:     {"a": [42], "b": [None], "c": [None]},
    ...:     schema = {"a": pl.Int64, "b": pl.Int64, "c": pl.Boolean}
    ...: )

In [26]: df.with_columns(
    ...:     add = pl.col("a") + pl.col("b"),
    ...:     mult = pl.col("a") * pl.col("b"),
    ...:     gt = pl.col("a") > pl.col("b"),
    ...:     lt = pl.col("a") < pl.col("b"),
    ...:     _not = ~pl.col("c")
    ...: )
Out[26]: 
shape: (1, 8)
┌─────┬──────┬──────┬──────┬──────┬───────┬───────┬──────┐
│ a   ┆ b    ┆ c    ┆ add  ┆ mult ┆ gt    ┆ lt    ┆ _not │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---  │
│ i64 ┆ i64  ┆ bool ┆ i64  ┆ i64  ┆ bool  ┆ bool  ┆ bool │
╞═════╪══════╪══════╪══════╪══════╪═══════╪═══════╪══════╡
│ 42  ┆ null ┆ null ┆ null ┆ null ┆ false ┆ false ┆ null │
└─────┴──────┴──────┴──────┴──────┴───────┴───────┴──────┘

So null propagates through arithmetic operations, as it should. But 42 > null returns false. That makes no sense to me. The right-hand side of > is missing, so it's impossible to decide if 42 is greater than the right-hand side. It should return UNKNOWN (or at least null).

But to make things even more inconsistent, ~ does propagate null. What's up with that?

borchero commented 1 year ago

To add to the confusion, comparison with scalar values does propagate null for all but == and !=, see:

>>> df = pl.DataFrame({"col": [1, 2, None, 3]})
>>> df > 2
shape: (4, 1)
┌───────┐
│ col   │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ false │
│ null  │
│ true  │
└───────┘
>>> df == 2
shape: (4, 1)
┌───────┐
│ col   │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
│ false │
└───────┘

CameronBieganek commented 1 year ago

Oh, I see what you're saying. Whether null propagates depends on which side of the inequality it is on. Here's a somewhat more explicit example:

In [12]: df = pl.DataFrame({"42": [42], "null": [None]})

In [13]: df.with_columns(
    ...:     (pl.col("42") > pl.col("null")).alias("42 > null"),
    ...:     (pl.col("null") < pl.col("42")).alias("null < 42")
    ...: )
Out[13]:
shape: (1, 4)
┌─────┬──────┬───────────┬───────────┐
│ 42  ┆ null ┆ 42 > null ┆ null < 42 │
│ --- ┆ ---  ┆ ---       ┆ ---       │
│ i64 ┆ f64  ┆ bool      ┆ bool      │
╞═════╪══════╪═══════════╪═══════════╡
│ 42  ┆ null ┆ false     ┆ null      │
└─────┴──────┴───────────┴───────────┘

Yikes.

alexander-beedie commented 1 year ago

So... there is a large overlap between null-poisoning behaviour (which we have made efforts to standardise before, across various functions) and 3VL, though they are not exactly the same thing - and as it looks like you have found some further internal inconsistencies I guess we're not done yet!

Rather than get tied-up in the more philosophical "what does null even mean" (eg: unknown, missing, or an 'actual' none value such as that found in Python) right at the start, let's focus on the inconsistencies first as that'll move us all closer to a common/desired end-state - fixing internal inconsistencies is an unambiguous win for everyone ;)
For clarity, do you think you could raise each as separate issues if you notice more than one? Will make it easier to track, rather than aggregating all of them into the comments of this issue; let's start with the comparison inconsistency above (42 > null vs null < 42), as I can't think of a great reason why switching lhs/rhs would result in different logic being applied here 🤔

At the very least we should be internally consistent, and we should also better-document the expected behaviour/rules relating to treatment and propagation of nulls. I suspect we'll finish in a place where everybody is mostly happy (nobody is ever completely happy ;)

Also, have a look over some of the previous threads on the topic of null-poisoning (referencing Kleene logic, various papers covering relational algebras with NULLs, etc). @sorhawell put together a great overview here, for example:

https://github.com/pola-rs/polars/issues/5359

(I remember there was a second issue that I can't find where someone had tracked null-poisoning/propagation behaviour across a large number of relevant functions and normalisation efforts were applied there; if I can track it down, I'll add here).

avimallu commented 1 year ago

Just because there's been no follow up on this for a few days - @alexander-beedie, do you want separate issues for each issue with null propogation in Polars that's inconsistent?

alexander-beedie commented 1 year ago

Just because there's been no follow up on this for a few days - @alexander-beedie, do you want separate issues for each issue with null propogation in Polars that's inconsistent?

@avimallu: That would be great, yes - can refer back to this issue too, for context. (Had family in town recently, so have been catching up on things!)

slonik-az commented 1 year ago

It would be great for polars to have consistent 3-value logic. Unfortunately, it has "3.5-value logic" with true, false, null, nan where nan playing role of a spoiler. For floats 'nan' occupies the same niche as null but is not quite the same forcing a user to check for both.

alexander-beedie commented 1 year ago

For floats 'nan' occupies the same niche as null but is not quite the same forcing a user to check for both.

@slonik-az: NaN has a specific and useful meaning that is quite distinct from NULL though - it is a basic part of the floating point specification. You can't (and shouldn't) avoid it. I suspect it is pandas' longstanding use of NaN to mean NULL (except when it doesn't) that it inherited from numpy that has soured people on it in the dataframe space... ;)

CameronBieganek commented 1 year ago

Taking another look at this, I see that Polars does support three-valued logic, because it has the correct truth tables for the logical boolean operators:

Not

In [13]: (
    ...:     pl.DataFrame(dict(
    ...:         A = [True, False, None]
    ...:     ))
    ...:     .with_columns(
    ...:         (~pl.col("A")).alias("~A")
    ...:     )
    ...: )
Out[13]:
shape: (3, 2)
┌───────┬───────┐
│ A     ┆ ~A    │
│ ---   ┆ ---   │
│ bool  ┆ bool  │
╞═══════╪═══════╡
│ true  ┆ false │
│ false ┆ true  │
│ null  ┆ null  │
└───────┴───────┘

And

In [14]: (
    ...:     pl.DataFrame(dict(
    ...:         A = [True, True, True, False, False, False, None, None, None],
    ...:         B = [True, False, None, True, False, None, True, False, None]
    ...:     ))
    ...:     .with_columns(
    ...:         (pl.col("A") & pl.col("B")).alias("A & B")
    ...:     )
    ...: )
Out[14]:
shape: (9, 3)
┌───────┬───────┬───────┐
│ A     ┆ B     ┆ A & B │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  │
│ true  ┆ false ┆ false │
│ true  ┆ null  ┆ null  │
│ false ┆ true  ┆ false │
│ false ┆ false ┆ false │
│ false ┆ null  ┆ false │
│ null  ┆ true  ┆ null  │
│ null  ┆ false ┆ false │
│ null  ┆ null  ┆ null  │
└───────┴───────┴───────┘

Or

In [15]: (
    ...:     pl.DataFrame(dict(
    ...:         A = [True, True, True, False, False, False, None, None, None],
    ...:         B = [True, False, None, True, False, None, True, False, None]
    ...:     ))
    ...:     .with_columns(
    ...:         (pl.col("A") | pl.col("B")).alias("A | B")
    ...:     )
    ...: )
Out[15]:
shape: (9, 3)
┌───────┬───────┬───────┐
│ A     ┆ B     ┆ A | B │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  │
│ true  ┆ false ┆ true  │
│ true  ┆ null  ┆ true  │
│ false ┆ true  ┆ true  │
│ false ┆ false ┆ false │
│ false ┆ null  ┆ null  │
│ null  ┆ true  ┆ true  │
│ null  ┆ false ┆ null  │
│ null  ┆ null  ┆ null  │
└───────┴───────┴───────┘

Conclusion

All that's left is to properly propagate null through all numerical comparison operators. null should propagate through all numerical comparisons, because null is not a number---it is a marker for a missing value. Currently, null does not propagate through == and != (tracked in #8182).

CameronBieganek commented 1 year ago

Closing in favor of the issues that track specific issues with null propagation, e.g. #8182, #8183.