Closed CameronBieganek closed 1 year ago
The only major domain in which that would be expected is that of SQL/databases - and the behaviour of NULL values is one of the most frequently mistaken things about SQL, as the concept of UNKNOWN is further overloaded with both MISSING and NONE. In other words three-value logic is in some senses closer to five-value logic, except you can't properly distinguish between those last three values[^1]. As someone who has worked with a lot of databases, this has long been a pet peeve of mine 🤣
Truthfully I don't see this happening as it would be extremely surprising behaviour given that neither Python, Rust, nor JS users would expect it, and polars isn't actually a database.
[^1]: A NONE value is definitely not equal to 42. A MISSING value is definitely not equal to 42. An UNKNOWN value might (or might not) be equal to 42. All three of these possibilities are bundled under just "NULL".
First, let's clarify the terminology.
NULL
is a marker that means the absence of a value.
NULL
is not a value, it is a marker.UNKNOWN
is the third boolean value in 3VL.
UNKNOWN
is a boolean value. It is not a marker.UNKNOWN
is not the same as NULL
.
UNKNOWN
with a boolean NULL
marker. 🤦♂️ NONE
: I don't know what you have in mind for the word NONE
, but I don't think it needs to enter into this discussion.I disagree with your footnote that says "A MISSING value is definitely not equal to 42". The expression <missing> == 42
should return UNKNOWN
. That is approximately what happens in SQL, where NULL = 42
evaluates to NULL
.
I agree that the best representation of missing data is a problem that still has not been solved by any language, including SQL. I also agree that the SQL implementation of 3VL is flawed. As noted above, it is unfortunate that SQL represents the boolean UNKNOWN
by a boolean NULL
marker. However, I believe that in practice SQL's 3VL is useful for representing missing data, as long as you are careful about handling NULL
s in your queries. In other words, it's better than only having 2VL and not having support for NULL
at all.
The point of view that you are espousing is that users should create their own representation of missing data that suits the domain they are working in. That is a valid point of view, but if that's the case, then Polars should not support null
markers. If you have null
markers, then you should support 3VL. It's one or the other---either have 2VL with no null
markers, or have 3VL with null
markers. You can't have it both ways. If you have null
markers, you should have 3VL.
The only major domain in which that would be expected is that of SQL/databases
That's not true. Pandas does support 3VL, thought it is still marked as experimental:
>>> df = pd.DataFrame({"a": [42, 100], "b": [None, 2]}, dtype="Int64"); df
a b
0 42 <NA>
1 100 2
>>> df.a == df.b
0 <NA>
1 False
dtype: boolean
>>> df.a != df.b
0 <NA>
1 True
dtype: boolean
Furthermore, R and Julia natively support 3VL:
R:
> 42 == NA
[1] NA
> 42 != NA
[1] NA
Julia:
julia> 42 == missing
missing
julia> 42 != missing
missing
If SQL, R, Julia, and Pandas all support 3VL, doesn't it seem reasonable for Polars to support 3VL? One of the things I like about Polars is how the semantics are so similar to SQL semantics. However, not supporting 3VL is a radical divergence from SQL. It's a footgun for sure.
Polars has the opportunity to improve upon the 3VL in SQL by providing a Boolean
type that is truly three-valued, with values true
, false
, and unknown
.
Here's another example:
In [25]: df = pl.DataFrame(
...: {"a": [42], "b": [None], "c": [None]},
...: schema = {"a": pl.Int64, "b": pl.Int64, "c": pl.Boolean}
...: )
In [26]: df.with_columns(
...: add = pl.col("a") + pl.col("b"),
...: mult = pl.col("a") * pl.col("b"),
...: gt = pl.col("a") > pl.col("b"),
...: lt = pl.col("a") < pl.col("b"),
...: _not = ~pl.col("c")
...: )
Out[26]:
shape: (1, 8)
┌─────┬──────┬──────┬──────┬──────┬───────┬───────┬──────┐
│ a ┆ b ┆ c ┆ add ┆ mult ┆ gt ┆ lt ┆ _not │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ bool ┆ i64 ┆ i64 ┆ bool ┆ bool ┆ bool │
╞═════╪══════╪══════╪══════╪══════╪═══════╪═══════╪══════╡
│ 42 ┆ null ┆ null ┆ null ┆ null ┆ false ┆ false ┆ null │
└─────┴──────┴──────┴──────┴──────┴───────┴───────┴──────┘
So null
propagates through arithmetic operations, as it should. But 42 > null
returns false
. That makes no sense to me. The right-hand side of >
is missing, so it's impossible to decide if 42 is greater than the right-hand side. It should return UNKNOWN
(or at least null
).
But to make things even more inconsistent, ~
does propagate null
. What's up with that?
To add to the confusion, comparison with scalar values does propagate null
for all but ==
and !=
, see:
>>> df = pl.DataFrame({"col": [1, 2, None, 3]})
>>> df > 2
shape: (4, 1)
┌───────┐
│ col │
│ --- │
│ bool │
╞═══════╡
│ false │
│ false │
│ null │
│ true │
└───────┘
>>> df == 2
shape: (4, 1)
┌───────┐
│ col │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
│ false │
│ false │
└───────┘
Oh, I see what you're saying. Whether null
propagates depends on which side of the inequality it is on. Here's a somewhat more explicit example:
In [12]: df = pl.DataFrame({"42": [42], "null": [None]})
In [13]: df.with_columns(
...: (pl.col("42") > pl.col("null")).alias("42 > null"),
...: (pl.col("null") < pl.col("42")).alias("null < 42")
...: )
Out[13]:
shape: (1, 4)
┌─────┬──────┬───────────┬───────────┐
│ 42 ┆ null ┆ 42 > null ┆ null < 42 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ bool │
╞═════╪══════╪═══════════╪═══════════╡
│ 42 ┆ null ┆ false ┆ null │
└─────┴──────┴───────────┴───────────┘
Yikes.
So... there is a large overlap between null-poisoning behaviour (which we have made efforts to standardise before, across various functions) and 3VL, though they are not exactly the same thing - and as it looks like you have found some further internal inconsistencies I guess we're not done yet!
Rather than get tied-up in the more philosophical "what does null even mean" (eg: unknown, missing, or an 'actual' none value such as that found in Python) right at the start, let's focus on the inconsistencies first as that'll move us all closer to a common/desired end-state - fixing internal inconsistencies is an unambiguous win for everyone ;)
For clarity, do you think you could raise each as separate issues if you notice more than one? Will make it easier to track, rather than aggregating all of them into the comments of this issue; let's start with the comparison inconsistency above (42 > null
vs null < 42
), as I can't think of a great reason why switching lhs
/rhs
would result in different logic being applied here 🤔
At the very least we should be internally consistent, and we should also better-document the expected behaviour/rules relating to treatment and propagation of nulls. I suspect we'll finish in a place where everybody is mostly happy (nobody is ever completely happy ;)
Also, have a look over some of the previous threads on the topic of null-poisoning (referencing Kleene logic, various papers covering relational algebras with NULLs, etc). @sorhawell put together a great overview here, for example:
(I remember there was a second issue that I can't find where someone had tracked null-poisoning/propagation behaviour across a large number of relevant functions and normalisation efforts were applied there; if I can track it down, I'll add here).
Just because there's been no follow up on this for a few days - @alexander-beedie, do you want separate issues for each issue with null
propogation in Polars that's inconsistent?
Just because there's been no follow up on this for a few days - @alexander-beedie, do you want separate issues for each issue with
null
propogation in Polars that's inconsistent?
@avimallu: That would be great, yes - can refer back to this issue too, for context. (Had family in town recently, so have been catching up on things!)
It would be great for polars to have consistent 3-value logic. Unfortunately, it has "3.5-value logic" with true, false, null, nan
where nan
playing role of a spoiler. For floats 'nan' occupies the same niche as null
but is not quite the same forcing a user to check for both.
For floats 'nan' occupies the same niche as
null
but is not quite the same forcing a user to check for both.
@slonik-az: NaN
has a specific and useful meaning that is quite distinct from NULL
though - it is a basic part of the floating point specification. You can't (and shouldn't) avoid it. I suspect it is pandas' longstanding use of NaN
to mean NULL
(except when it doesn't) that it inherited from numpy that has soured people on it in the dataframe space... ;)
Taking another look at this, I see that Polars does support three-valued logic, because it has the correct truth tables for the logical boolean operators:
In [13]: (
...: pl.DataFrame(dict(
...: A = [True, False, None]
...: ))
...: .with_columns(
...: (~pl.col("A")).alias("~A")
...: )
...: )
Out[13]:
shape: (3, 2)
┌───────┬───────┐
│ A ┆ ~A │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ true ┆ false │
│ false ┆ true │
│ null ┆ null │
└───────┴───────┘
In [14]: (
...: pl.DataFrame(dict(
...: A = [True, True, True, False, False, False, None, None, None],
...: B = [True, False, None, True, False, None, True, False, None]
...: ))
...: .with_columns(
...: (pl.col("A") & pl.col("B")).alias("A & B")
...: )
...: )
Out[14]:
shape: (9, 3)
┌───────┬───────┬───────┐
│ A ┆ B ┆ A & B │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
│ true ┆ false ┆ false │
│ true ┆ null ┆ null │
│ false ┆ true ┆ false │
│ false ┆ false ┆ false │
│ false ┆ null ┆ false │
│ null ┆ true ┆ null │
│ null ┆ false ┆ false │
│ null ┆ null ┆ null │
└───────┴───────┴───────┘
In [15]: (
...: pl.DataFrame(dict(
...: A = [True, True, True, False, False, False, None, None, None],
...: B = [True, False, None, True, False, None, True, False, None]
...: ))
...: .with_columns(
...: (pl.col("A") | pl.col("B")).alias("A | B")
...: )
...: )
Out[15]:
shape: (9, 3)
┌───────┬───────┬───────┐
│ A ┆ B ┆ A | B │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
│ true ┆ false ┆ true │
│ true ┆ null ┆ true │
│ false ┆ true ┆ true │
│ false ┆ false ┆ false │
│ false ┆ null ┆ null │
│ null ┆ true ┆ true │
│ null ┆ false ┆ null │
│ null ┆ null ┆ null │
└───────┴───────┴───────┘
All that's left is to properly propagate null
through all numerical comparison operators. null
should propagate through all numerical comparisons, because null
is not a number---it is a marker for a missing value. Currently, null
does not propagate through ==
and !=
(tracked in #8182).
Closing in favor of the issues that track specific issues with null
propagation, e.g. #8182, #8183.
Polars version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
Polars does not implement three-valued logic.
Here is a link to a SQLFiddle example showing the expected behavior for the Polars example below: http://sqlfiddle.com/#!17/a30e47/1/0
Reproducible example
Expected behavior
Installed versions