pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.71k stars 2k forks source link

Configure handling of null values in aggregations #10016

Open stinodego opened 1 year ago

stinodego commented 1 year ago

Problem description

This is an often-requested feature.

There is some inconsistency between aggregations on how null values are handled, and the way they are handled is currently not configurable.

Examples:

Vertical aggregation

import polars as pl

df = pl.DataFrame({"a": [1, 2, None], "b": [3, None, None]})

# Nulls are ignored. What if I want to raise on null values?
df["a"].sum()  # 3

# One null means the product is null. What if I want to ignore null values?
df["a"].product()  # None - EDIT: Fixed, now returns 2

Horizontal aggregation

# A single null results in null. What if I want null to be treated as 0? - EDIT: Nulls are now ignored
df.with_columns(pl.sum_horizontal("a", "b"))
shape: (3, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ sum  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 3    ┆ 4    │
│ 2    ┆ null ┆ 2    │
│ null ┆ null ┆ 0    │
└──────┴──────┴──────┘

# Nulls are ignored, all nulls results in null. What if I want to propagate null values?
df.with_columns(pl.max_horizontal("a", "b"))
shape: (3, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ max  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 3    ┆ 3    │
│ 2    ┆ null ┆ 2    │
│ null ┆ null ┆ null │
└──────┴──────┴──────┘

We should add some conguration options to both horizontal and vertical aggregations.

Please use this thread to discuss possible configuration options.

ritchie46 commented 1 year ago

Can you elaborate a bit on what you mean?

Null values are sometimes handled differently by design. A sum of an empty set is 0, while a var of an empty set doesn't exist and returns None.

The empty set analogy was discussed earlier and that is what we are following now. If we deviate from that somewhere, I think we must open an issue for that on a per case basis.

stinodego commented 1 year ago

Added some examples from the linked issues.

I believe @avimallu also requested this in a PR somewhere.

avimallu commented 1 year ago

Thanks @stinodego. If the empty set analogy was added in, then I think https://github.com/pola-rs/polars/issues/8184 should be closed.

I think there are two suggestions mainly from users: some want null propagation whenever there is null (like me), and some want nulls to be excluded whenever it is there.

In the case of horizontal aggregations, some of the functions (like sum_horizontal) propagate nulls always, while some (like max_horizontal) only when the entire set of values is null propagate it.

In the case of vertical aggregations (that is, by column), we ignore nulls and don't provide a way to go around it (in case someone wants to).

My thought on this is that whatever Polars decides to do, it should do so consistently:

ion-elgreco commented 1 year ago

The default behavior of returning nulls in sum_horizontal is odd. You expect this to work similarly to a sum on a column, there it will simply ignore the null values.

Using null_strategy like @avimallu proposes should work well here, and then setting the default to ignore_nulls. If we take a look at Pandas they use skipna for this and it's by default set to True.

Sage0614 commented 1 year ago

In terms of handling Null consistently in aggregation, I think there are a few types of aggregations which people will encounter now or in the future, and people would generally expect you are doing consistent things across them:

case 1. aggregation of N by 1 sequence to scaler value, like simple sum([1,2,3,None]) to scaler; case 2. aggregation of N by m sequence to scaler value, like weighted sum, or corr, in which each input sequence can have null; case 3. aggregation of N by m sequence to N by 1 sequence, like rolling sum or rolling corr, and in each position in the output sequence you need to decide whether to put null or not.

Also, we have a corner case where the input sequence is all null

both case 1 and case 2 can be seen as special case of case 3: case 1 is case 2 when you treat it as 2 sequences of both the original sequence with its bit mask of null; case 2 is case 3 when the window length is the whole sequence.

so, we only need to have a consistent way to deal with case 3 and let case 1 and case 2 to respect the implementation.

to set the default behavior of null handling to be "ignore" is dangerous in my view, if the null is there for some unexpected reason, it almost always delay the issue to be raised much later in the data science pipeline. and you need to go back to find the root cause, or even worse, it is never found until many days later.

set the default behavior to populate null force the user to think how they should handle null value if it exists, if the downstream process cannot handle null, it fails early and fails fast.

for those people who like to save a few key stroke and not like the populate null behavior, I wonder whether there's a per session level option that they can use to alter the default behavior of null handling?

Ludecan commented 1 year ago

My two cents on the the discussion as a user, coming from this issue.

The way the R language handles missing values in aggregations is like stinodego is proposing. It is by providing an na.rm (default false, no removing nulls unless explicitly stated) parameter, much like null_strategy, indicating whether the nulls should be stripped before calculating the aggregation or not. Then all operations propagate nulls if any.

This approach I've found to be very useful, since there's cases when I want to propagate the nulls say, what's the accumulated yearly rainfall if I'm missing half the values in the year, I'd want a null there. What's the average daily rainfall if I'm only missing 3 days? I might want to ignore nulls and calculate the average either way as my sample size is big enough. I believe case 1 exposed by Sage0614 can be handled by this according to the user's needs. Oddly sum of an empty set is 0 (regardless of na.rm) and prod of an empty set is 1. What is the yearly rainfall accumulation if I have no days in registry? I would expect an NA there, at least if na.rm is true. But I get the sum of an empty set argument.

For case 2 R is a bit more involved. It provides the use parameter of cor. "everything" means to use all data series complete and propagate Nulls, "all.obs" raises an error if there's any Null in any of the series, "complete.obs" deletes rows for which there are Nulls on any series and raises an error if all rows have Nulls, "na.or.complete" deletes rows for which there are Nulls on any series but returns Null if all rows have Nulls, finally "pairwise.complete.obs" uses all points across all pairs of series where both values in the pair are not Null (and returns Null if all pairs are Null).

For case 3, rolling aggregations, R doesn't provide a way to remove Nulls and they are propagated as soon as they are found, you can't remove nulls before the aggregation because it changes the length of the series/return. Pandas however does, and it returns null for the location of the missing value, and then resumes accumulation in the next index. I've seen this used in Rainfall data QC, where you calculate the cumsum of two stations with nulls, and find by how much the stations deviated by the end of the year, and in which day the deviations started.

I wouldn't use the null_strategy to do more sophisticated stuff like imputing 0s or imputing the mean. In the above rainfall accumulation example if you're missing only a week of data you might be interested in imputing the missing days as the mean of the remaining days to get an unbiased accumulation (or with 0s if it's a place/time with little rain). If you want to do that, you can manually and explicitly do it (and more complex imputations as well) before calculating your aggregates, so adding more logic in the parameter bloats the API/codebase in my mind and doesn't provide anything extra.

Thanks for all your hard work. Polars is an awesome library.

Sage0614 commented 1 year ago

Thanks @Ludecan 's detail explanation, I've also used R for long time, that's exactly what I am trying to bring about, for the different type of functions in case 1,2 and 3, if you don't propagate null by default, the most "make sense" default behavior is going to be very different from group of function to group function, user cannot remember all the default behavior for different aggregation functions and it will violate the expectation that functions in a library should behave the same way.

owenprough-sift commented 1 year ago

Null values are sometimes handled differently by design. A sum of an empty set is 0, while a var of an empty set doesn't exist and returns None.

The empty set analogy was discussed earlier and that is what we are following now. If we deviate from that somewhere, I think we must open an issue for that on a per case basis.

The SQL standard states that sum() of zero rows is null. If sum of an empty set in polars is intentionally 0, that should be made clear in the documentation/examples. Polars' sum of an all-null series --> 0 tripped me up today and it took me a bit to figure out why I wasn't getting null.

ritchie46 commented 1 year ago

Null values are sometimes handled differently by design. A sum of an empty set is 0, while a var of an empty set doesn't exist and returns None. The empty set analogy was discussed earlier and that is what we are following now. If we deviate from that somewhere, I think we must open an issue for that on a per case basis.

The SQL standard states that sum() of zero rows is null. If sum of an empty set in polars is intentionally 0, that should be made clear in the documentation/examples. Polars' sum of an all-null series --> 0 tripped me up today and it took me a bit to figure out why I wasn't getting null.

Maybe open a PR to improve the docs?

stevenlis commented 1 year ago

@stinodego I couldn't find any information in the changelog and encountered the issue. My code was in version 0.17.3, and previously, all None would return null. However, in version 0.19.6, it now returns 0.

import polars as pl

df = pl.DataFrame({"a": [1, 2, None], "b": [None, None, None]})
print(df["a"].sum())
print(df["b"].sum())
3
0.0

My code looks like

pl.DataFrame(
    {'id': ['1', '1', '1'],
     'a': [1, 1, 1],
     'b': ['A', 'A', 'B']}
).with_columns(
    pl.col('a').filter(pl.col('b') == 'C').sum().over('id')
)

Before in 0.17.3

shape: (3, 3)
┌─────┬──────┬─────┐
│ id  ┆ a    ┆ b   │
│ --- ┆ ---  ┆ --- │
│ str ┆ i64  ┆ str │
╞═════╪══════╪═════╡
│ 1   ┆ null ┆ A   │
│ 1   ┆ null ┆ A   │
│ 1   ┆ null ┆ B   │
└─────┴──────┴─────┘

Now in 0.19.6

shape: (3, 3)
┌─────┬─────┬─────┐
│ id  ┆ a   ┆ b   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 0   ┆ A   │
│ 1   ┆ 0   ┆ A   │
│ 1   ┆ 0   ┆ B   │
└─────┴─────┴─────┘
cmdlineluser commented 1 year ago

@stevenlis I think that change was boolean addition https://github.com/pola-rs/polars/pull/8778

Seems to have been part of 0.17.13 https://github.com/pola-rs/polars/releases/tag/py-0.17.13

ritchie46 commented 10 months ago

I want to keep the empty set anology.

For anyone who wants to deviate from Polars's defaults. Use Polars' strength, its composability. It is very easy to create an aggregation expression that returns Null when there are Nulls in the Series.

from typing import Callable
import polars as pl

def null_agg(col: str, agg: Callable[[pl.Expr], pl.Expr]) -> pl.Expr:
    return pl.when(pl.col(col).null_count() > 0).then(None).otherwise(agg(pl.col(col)))

pl.DataFrame({
    "a": [1, 2, None, 3]
}).select(
    null_sum = null_agg("a", lambda c: c.sum()),
    zero_sum = pl.col("a").sum()
)
MarkMoulton commented 10 months ago

ritchie46 wrote: "I want to keep the empty set analogy."

The problem is that this analogy is not appropriate for dataframes, so it leads to mathematical inconsistencies. A Polars column is not a set. It is a variable. Sets have only "is a member" (not null) and "is not a member" (is null). The null bitmap is the appropriate representation for sets. The count() method is the appropriate way to perform aggregations on sets.

Variables have this membership property, but in addition their members are assigned a numerical value and data type that have nothing to do with membership. They have "units" (mass, velocity, dollars, success, days, etc.). Sums, means, variances, etc. are the appropriate ways to aggregate the values of elements in variables, with this essential rule: The value resulting from an aggregation operation applied to elements with a given unit should carry the same unit, and be interpretable as such without ambiguity.

When requiring a sum of nulls to return zero, this rule is violated. In this case, "0" does not signify a number with the same units as the elements. It reflects a different unit -- the "membership unit" corresponding to sets. The consequence is that depending merely on the accident of whether a given column (or row) is populated with data, for instance after a dataframe has been sliced and diced in various ways, its units will switch capriciously between membership units and the proper variable units. This will yield mathematically unsound results that will in many cases be invisible to the user.

Yes, the alert user can write work-arounds, but why should she have to? It would be better to have at least a null_strategy option as discussed here. But the proper and only mathematically sound solution is make a sum of nulls return null as the default behavior, as it was before.

stinodego commented 9 months ago

In this case, "0" does not signify a number with the same units as the elements.

I don't think that's true. A column indeed represents a certain property, measurement, or whatever. Number of days, dollars, etc. The data type determines the representation of that unit, e.g. an integer representing the number of days. So it would be reasonable to assume the sum of such a column would be an integer representing the number of days, as you point out. Even if some data in the column is missing (or indeed all the data is missing), the result should still be an integer representing the number of days. If I set the result to null I would actually be violating this principle. So the way I read it, you are making a case for the status quo, which is returning 0 on a column with missing data.

MarkMoulton commented 9 months ago

It would certainly not be safe to assume that the sum(all null) is in the same units as the intended variable. Take the example of temperature readings. For some reason, sensor 49 isn't working and returns a series of nulls. The user naively decides to sum up that column, not knowing it is all nulls, and comes to the conclusion: It must be cold around sensor 49; all its temperatures add up to 0 degrees Celsius.

(In practice, the user in this case would be saved by using mean() instead of sum(), but that's a matter of luck.)

Consider a series with only one cell, its value null. Applying the sum returns 0. The user concludes the sensor is working and registers 0 degrees, even though the datum itself is clearly non-existent. You are forced to take the absurd position that 0 = null, that a number is not a number.

The only safe way to proceed given sum(all null) = 0 is to check the count of nulls in each column, row, and group during group_by and manually change the sum to null if they are all null. How efficient is that? Do you really want to force every user to do that manually for something as simple and important and widely used as summation?

I just went through that exercise, rewriting my own code to add all these stupid null checks. It was no picnic.

stinodego commented 9 months ago

(In practice, the user in this case would be saved by using mean() instead of sum(), but that's a matter of luck.)

I don't think it's a matter of luck. Mean has a null identity precisely for the reasons you give in your example.

However, if I have a column with the bank account balance as a float, and I have 0 entries in my data, I can confidently say that the total amount of dollars of people in that table is 0.0.

Consider a series with only one cell, its value null. Applying the sum returns 0. The user concludes the sensor is working

You shouldn't use the sum aggregation to determine the sensor is working. You should use null_count.

I'm sure there are good reasons for sum/product to have null identities, but I am not finding any good reasons in your arguments.

There is definitely a good argument for propagating nulls though. You could argue the sum or the maximum of a column is unknown if some of the values are missing. While you can work around this by writing your own aggregation function as Ritchie showed, I would say this is common enough to warrant its own parameter, e.g. ignore_nulls.

Then again, you could argue that we would also need a min_count parameter like pandas has, and there are probably a number of parameters that would also make sense on possibly every aggregation function. We have to stop somewhere - and maybe we should just not include any parameters at all to emphasize the composability that makes Polars good.

Finally, I did go back to the examples in the original issue and those have all been standardized to ignore nulls, so at least we're consistent now.

bmarcj commented 9 months ago

Speaking strictly as a user, the least surprising functionality is probably to match SQL. Having "What does postgres do?" line up with polars is very helpful, unless there's really good reasons to deviate (and ideally this is clearly documented, and obviously surmountable in the API).

MarkMoulton commented 9 months ago

@bmarcj I note that postgres SQL seems to return Null when given a sum of Nulls. Do I have that right?

@stinodego: I have shown how the sum(nulls)=0 leads to an unpredictable, and mathematically forbidden, change of units when applied to variables (as opposed to sets). But you have disregarded that.

You have taken the position that a series of missing temperature readings should sum to 0.0 degrees Celsius, and that a series of missing dollar figures should sum to $0.00.

Having committed to the principle that the sum of two nulls should be 0 and the product of two nulls should be 1, the next logical step is to extend this principle to element-wise addition and multiplication. Yet Polars has not done this (see below), nor any other numerical package I know of. If the sum(nulls)=0 principle is sound, why haven't they?

The answer is that the results become dangerously uninterpretable, especially as such sums and products are passed into other functions. Exactly the same uninterpretability occurs with aggregation expressions, which is how I discovered this change in Polars behavior. My correlations started returning biased results due to the random appearance of zeros from aggregated variables. But that is the behavior you have committed yourself (and the rest of us) to.

df = pl.DataFrame({'a': [1, 2, None], 'b': [3, None, None]})
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 3    │
│ 2    ┆ null │
│ null ┆ null │
└──────┴──────┘
df.with_columns([(pl.col('a') + pl.col('b')).alias('sum'), (pl.col('a') * pl.col('b')).alias('product')])
shape: (3, 4)
┌──────┬──────┬──────┬─────────┐
│ a    ┆ b    ┆ sum  ┆ product │
│ ---  ┆ ---  ┆ ---  ┆ ---     │
│ i64  ┆ i64  ┆ i64  ┆ i64     │
╞══════╪══════╪══════╪═════════╡
│ 1    ┆ 3    ┆ 4    ┆ 3       │
│ 2    ┆ null ┆ null ┆ null    │
│ null ┆ null ┆ null ┆ null    │
└──────┴──────┴──────┴─────────┘
# But the sum(nulls)=0 and product(nulls)=1 principle would return this instead:
shape: (3, 4)
┌──────┬──────┬──────┬─────────┐
│ a    ┆ b    ┆ sum  ┆ product │
│ ---  ┆ ---  ┆ ---  ┆ ---     │
│ i64  ┆ i64  ┆ i64  ┆ i64     │
╞══════╪══════╪══════╪═════════╡
│ 1    ┆ 3    ┆ 4    ┆ 3       │
│ 2    ┆ null ┆ null ┆ null    │
│ null ┆ null ┆ 0    ┆ 1       │
└──────┴──────┴──────┴─────────┘
# Is that what you want?
stinodego commented 9 months ago

You are conflating two separate things here: the identity of the aggregation, and the propagation of null values.

The identity pertains to the result of pl.Series([]).sum(). Polars sets this to 0. So does pandas, for example. PyArrow sets this to null.

Propagation of null values pertains to the result of pl.Series([1, None]).sum(). Polars ignores nulls by default and currently does not offer the option to propagate null values. pandas and pyarrow also ignore null values by default, but offer an option to propagate them.

It's clear to me you would like to propagate null values for your use case. We can consider adding a parameter for that like other packages do. Until then, you can define your own aggregation as Ritchie has shown.

We are not planning to change the identity of the aggregation. From your responses you seem to want a null identity but are not giving any reasons why.

avimallu commented 9 months ago

The source of this problem, in my opinion is what nulls_strategy="ignore_nulls" means. In R, which is where I came from originally, the equivalent na.rm quite literally removes null values. See here:

mean(c(1, NA, NA), na.rm=TRUE) # produces 1, not 0.33333, as one would expect.

Numpy, nor Pandas

Bringing numpy in reference to how to handle null values is futile, because it doesn't have null values in the first place. Pandas has typically always worked around how numpy works in order to handle null values, and it obviously hasn't done a good job. Polars is new, and has yet to concretely lay out how to handle them (which is why this issue is there in the first place).

Now, to the issue at hand:

  1. If Polars decides to follow R's na.rm argument, then sum of nulls producing 0 is perfectly fine, since null values are ignored by removing them, thereby creating an empty set.
  2. What you seem to want, @MarkMoulton, is the ability to always propagate null values regardless of the above logic.

The solution from a complexity perspective, would then be to simply implement what has been proposed in this thread - ignore_nulls with perhaps slightly diffferent options:

  1. remove - identical to the na.rm argument in R, which Polars seems to be following now (from pl.Series("a", [None, None, 3]).mean() which returns 3.0).
  2. propagate - which is to always say that the presence of null values will result in null values.

Speaking strictly as a user

For me, what would be surprising is behavior that differs from R, although I admit I'll be a minority in there compared to SQL users ;)

bmarcj commented 9 months ago

For me, what would be surprising is behavior that differs from R

That's fair! I think polars has attracted a lot of R refugees unhappy with pandas. (I am also one.)

What does dplyr and data.table do in the three scenarios (all nulls, empty series, mixture)?

avimallu commented 9 months ago

They follow base R:

> library(data.table)
> library(dplyr)
> df = data.frame(
+   a = c(NA, 3, NA),
+   b = c(NA, NA, NA))
> as_tibble(df) |> summarize(across(.cols = c(a, b), \(x) sum(x)))
# A tibble: 1 × 2
      a     b
  <dbl> <int>
1    NA    NA
> as_tibble(df) |> summarize(across(.cols = c(a, b), \(x) sum(x, na.rm=TRUE)))
# A tibble: 1 × 2
      a     b
  <dbl> <int>
1     3     0
> as.data.table(df)[, lapply(.SD, \(x) sum(x))]
       a     b
   <num> <int>
1:    NA    NA
> as.data.table(df)[, lapply(.SD, \(x) sum(x, na.rm=TRUE))]
       a     b
   <num> <int>
1:     3     0
bmarcj commented 9 months ago

Thanks. So to be explicit, both R and SQL specification handles this differently (although note that in e.g. avg(x), null is ignored and the example matches the R and polars case where the null rows are literally thrown away).

(For @MarkMoulton: https://www.db-fiddle.com/f/h1cCu17vL8m8dWM8caGJpu/1)

Well, I have no arguments from set theory or mathematics to make claim that one is superior to the other.

Beyond obvious consistency from which ever choice is made, I would hope that a third (fourth including pandas?) standard might not be introduced!

MarkMoulton commented 9 months ago

@stinodego I don't intend to conflate the propagation of Nulls with the Null identity problem.

Propagation vs Null Identity A series that contains some Nulls, but not all, can quite legitimately pass on numerical values by ignoring them, as in R. It seems okay to allow the user to decide that behavior.

However, aggregation of a series that contains all Nulls should not, in any context other than set theory (e.g., based on the Null bitmap) or when using the count() function, return a number. The Null Identity is not valid for variables, only for sets. That is because a Null Identity number, in the context of a variable, has the wrong units and will degrade the validity of any downstream calculations.

But I repeat myself.

bmarcj commented 9 months ago

There's a third case, which is a series containing no rows at all.

I was going to look at product - but postgres has no such aggregation (forcing the exponent of the sum of the logs instead!). But this at least gives another consistency check between however the product of an empty/null/mixed series should be defined, as the two should always agree.

MarkMoulton commented 9 months ago

As a final meditation, does this feel right to you:

None + None = 0 None * None = 1

If not, why does this feel right:

series.sum([None, None]) = 0 series.product([None, None]) = 1

stinodego commented 9 months ago

You cannot really compare aggregations with arithmetic that way.

If I aggregate a billion values, odds are there is at least a bunch of null values in there and with propagating null values the result would then be null. This is quite useless in a lot of use cases.

If I sum two columns with a billion values and a few nulls, the resulting column still has a few nulls but also a whole lot of entries with the sum.

MarkMoulton commented 9 months ago

@stinodego Aggregations are arithmetic!

But you miss my intention regarding propagating nulls. If a column contains a mix of nulls and values, I have no problem at all with ignoring the nulls and returning a number. That is my preference, in fact, for the reasons you suggest.

What I will not concede is that a column containing all nulls, no values, should yield any value other than null -- except when the aggregating function is count().

MarkMoulton commented 9 months ago

@bmarcj Exactly!

bmarcj commented 9 months ago

(For clarity, I above asked if @MarkMoulton was advocating for the SQL standard - and managed to delete my post.)

bmarcj commented 9 months ago

Based on my understanding so far:

------------- R R (remove nulls) SQL Polars
Sum all nulls null 0 null 0
Sum empty series 0 0 null 0
Sum mixed series null value value value
MarkMoulton commented 9 months ago

Yes, I'm agreeing with @bmarcj that the SQL standard works for me. Just to spell out my understanding:

  1. A sum or product of a column, row, or group that contains only nulls and no valid values should return null, not 0 or 1.
  2. This should continue to be the default behavior for all aggregation functions except for count().
  3. count() is special because it is a "set-related" function specifically dedicated to enumerating non-null values.
  4. Otherwise, if any null values are contained in the column (while still containing non-null values), the value that would be obtained by ignoring the null values should be returned, unless (perhaps) the user explicitly specifies that mixed aggregations should return Null.

I think this lines up with what @bmarcj has delineated in his helpful table under the SQL column. It also lines up with what used to be the Polars default behavior.

lukemanley commented 9 months ago

for completeness, just pointing out how pyarrow (and numpy with nan) let you control this behavior:

import polars as pl
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np 

pc.sum(pa.array([None], type=pa.float64()), min_count=1)  # -> None (default)
pc.sum(pa.array([None], type=pa.float64()), min_count=0)  # -> 0.0

np.sum(np.array(np.nan))       # -> nan
np.nansum(np.array([np.nan]))  # -> 0.0
ritchie46 commented 9 months ago

I don't want to change this and am opposed to the SQL analogy here. If in SQL a mixed null/values series sees NULL as the identity. All NULL should also be seen as the identity.

And most important to me. The sum of an empty Series is always 0.

The length of the data should NOT change the NULL interpretation. We are not going to change this.

If you want something different we have given you the building blocks to do so.

avimallu commented 9 months ago

(and numpy with nan)

Gently pointing out that this is not a measure of completeness. nan is not the same as null, so this comparison is unwise.

Sage0614 commented 9 months ago

Based on my understanding so far:

------------- R R (remove nulls) SQL Polars Sum all nulls null 0 null 0 Sum empty series 0 0 null 0 Sum mixed series null value value value

That's only for sum in R, R has a family of null handling behavior called na.options (where na is the null equivalent). The behavior documented here is na.rm = True/False where to remove na or not, this is for single value output function, when input is a vector and output is a vector, na can be participated in the window calculation function, where all concequent result have null, or na is skipped in calculation, but output vector put NA on exactly where input is missing, or simply skip and result in a output vector having different length vs input vector, for input have more than 1 vector, e.g. correlation matrix calc, the option is compete.obs or not on so on and so forth

The drawback of not populating null in my view, is not some mathematical completeness problem, the problem I see is:

correctness of calculation:

null_option anticipated_null not_anticipated_null populate null/correct null/correct not_populate not_null/correct not_null/incorrect

In real world, the problem is null is there for all sorts of reasons, and people are not always aware of the existence, for example, if I calculate some aggregation functions for each group of 3000 stocks and 10 years daily data, and a few groups contains null, and as a 'fluent' data scientist, I ran 20 lines of polars of chained operation altogether and get 1 summary value, it will be very hard to find out what you have done wrong, because you get a value after all, and maybe not to far away from the actual correct answer.

The value of populating null here is, if you don't anticipate a null in the end, you know something is wrong, and you go back to check. an incorrect anwser is not better than no answer, it's not polars' fault because it is user's ignorance, but the level of pain is the same.

MarkMoulton commented 9 months ago

@Sage0614 Your stock market analysis example is a great illustration of the problem. Nulls can be identified and worked around, and when they are ignored by default they tend not to require any extra work. The invisible propagation of 0's and 1's through a chained operation is another matter. That leads to bogus results that may not look bogus at the end of the pipeline. It was just such an event (my statistical routines starting turning up plausible but bogus results) that brought me to this thread. That's the kind of thing that, literally, keeps me up nights.

MarkMoulton commented 9 months ago

@ritchie46 wrote: "And most important to me. The sum of an empty Series is always 0."

I am genuinely curious to know what changed your mind in this regard. Was it a paper? A user complaint? An insight? Is there maybe an Issue number you can refer me to?

Another question: In the future, do you plan to adopt the same policy for the addition and multiplication of elements between series, to generate the following result?

df.with_columns(sum_ = pl.col('a') + pl.col('b'))
shape: (3, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ sum_ │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 4    ┆ 5    │
│ 2    ┆ null ┆ null │
│ null ┆ null ┆ 0*   │
└──────┴──────┴──────┘

Just to be clear, my position is that the third row should (and currently does, inconsistent with the new aggregation policy) return null, instead of 0.

Asterisk means I changed existing null to zero.

bmarcj commented 9 months ago

@MarkMoulton If you look at the above table, polars currently matches R when R is configured to throw away nulls. When summing across rows, it currently matches R when nulls are not thrown away.

I'd like this more explicit (other comments have suggested using R's approach to specifying if nulls should be removed by default or not, with the default set to true), but the two operations are less inconsistent and more doing different things - one removes nulls, the other does not.

MarkMoulton commented 9 months ago

@bmarcj Thanks, that's very helpful to me.

ritchie46 commented 9 months ago

@ritchie46 wrote: "And most important to me. The sum of an empty Series is always 0." I am genuinely curious to know what changed your mind in this regard. Was it a paper? A user complaint? An insight? Is there maybe an Issue number you can refer me to?

The mathematical sense of an empty set. And the same logic in the real world. "How many apples are in that empty basket?" would be zero.

We are not alone in this. Numpy and python have the same behavior:

 >>> import numpy as np
>>> np.sum([])
0.0
>>> sum([])
0
>>> np.product([])
1.0

As I said. You can create an expression that deviates from the defaults. Defaults can (and never will) please everyone. That's why we give you the option to deviate from this.

I want to close this discussion as I don't think it is useful to repeat everybody's takes over and over.