rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.35k stars 889 forks source link

[BUG] `all` aggregation does not return true when column is empty or contains all nulls #8757

Closed jlowe closed 1 year ago

jlowe commented 3 years ago

Describe the bug The all aggregation should return true for an empty column or an entirely nulls column, but instead it returns false. Semantically an all reduction indicates there are no false values anywhere in the column, complementing the any reduction which will only return true if there is at least one true value anywhere in the column. For an empty column or entirely nulls column, there are no valid values to inspect and therefore no false value to be found.

Steps/Code to reproduce bug Attached is a patch that adds reduction unit tests for these cases.

all-reduction-empty-or-nulls-patch.txt

Expected behavior The all aggregation is essentially a search for false and should return true unless there is at least one false value found in the column. Nulls are skipped, so an empty column or entirely nulls column means there are no false values to find and therefore should result in true.

beckernick commented 3 years ago

This doesn't present in the Python layer, perhaps due to additional null handling.

import cudf
​
s = cudf.Series([])
print(s.all())
​
s = cudf.Series([None, None])
print(s.all())
True
True
davidwendt commented 3 years ago

The result is actually an invalid scalar as described in the documentation for cudf::reduce https://docs.rapids.ai/api/libcudf/stable/group__aggregation__reduction.html#gae4a6e7b821f22ac479d15e8e394fc640

jlowe commented 3 years ago

The result is actually an invalid scalar as described in the documentation for cudf::reduce

Thanks @davidwendt, I suspect the confusion then is around the fact that the all aggregation can "fail" which seems odd to me. I would argue this aggregation should never fail on a boolean column and always produce a valid literal, because it should return true if none of the values were false. Nulls are skipped, so an empty or all-nulls column are equivalent, neither contains a false value, so therefore the reduction seems well-defined there.

davidwendt commented 3 years ago

I think supporting different results for empty/all-null for a general cudf::reduce API depending on the aggregation type may be a challenge. Perhaps this could be a feature request for a cudf::all_of (like std::all_of and thrust::all_of) API. There is an internal cudf::reduction::all() function that cudf::reduce actually calls for the all aggregation that I believe may already return the desired result. Right now, the empty/all-null case is being handled by the general cudf::detail::reduce function for all aggregation types (as documented). So the new cudf::all_of API would potentially only need to be wrapper on the existing cudf::reduction::all().

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jlowe commented 2 years ago

Still desired to see a cudf::all_of / cudf::any_of or other fix for this issue, as we continue to put workarounds in our code to make sure we check the validity of the reduction result before examining the reduction value. Low priority.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

jlowe commented 2 years ago

Still desired, low priority.

bdice commented 1 year ago

edit: Well, oops. I made an error below. Kleene logic is not the same as "nulls beget nulls." Leaving this comment for posterity.

I discussed this with @davidwendt yesterday. We compared the behavior of cudf::reduce with any on a column of two rows with the behavior of binary ops OR between two columns. I expected both to follow the Kleene three-valued logic truth table for disjunction (the OR operator) but it turns out that reductions are not adhering to this logic. This asymmetry was not what I expected. Across most of cudf, we tend to adhere to ~Kleene logic~ "nulls beget nulls." Reductions are currently an exception to this, and will remain an exception (but in a different way) if #12279 is merged. Below is a table with our findings. Changes between the current/proposed reduction results are bolded.

Any/or input "Or" Binops Result ~(Kleene logic)~ Current "Any" Result Proposed "Any" Result (#12279)
[true, true] true true true
[true, null] null true true
[true, false] true true true
[null, true] null true true
[null, null] null null false
[null, false] null false false
[false, true] true true true
[false, null] null false false
[false, false] false false false
[true] true true
[null] null false
[false] false false
[] null false

We briefly discussed that the proposed changes should retain associativity, and it seems like it does (any/all operators are trivially commutative).

I am not convinced that any/all reductions should necessarily behave the same as std::any_of / std::all_of, because cuDF's three-valued logic (true, false, null) is fundamentally different than the boolean logic of the C++ Standard Library. However, I won't oppose this change if others feel strongly in favor of it, because I think the current reduction behavior is already quite broken by not adhering to Kleene logic. This is not substantially worse, in my opinion, but I don't think the breaking change is necessary when reasonable workarounds exist: checking if the null count is equal to the size should be sufficient to catch both cases that currently return null: all-null inputs and empty inputs. Alternatively, users can check the validity of the returned scalar and replace null results with a false (valid) result to achieve the same effect as this PR.

Another downside of changing the behavior, in my view, is that any/all will never return null results, while other monoidal reductions (like sum) will continue to return null for some inputs (empty or all nulls). This is breaking a significant symmetry between operations that, from my experience, tend to have the same natural algebraic structure (sum/product/min/max/any/all).

jlowe commented 1 year ago

The problem with keeping the behavior as-is is that it's not self-consistent and arguably wrong in some cases. any means "are any of the values true". It should return false for an empty input, as there are no values that are true. NULL == TRUE is not true, so IMO it should also return false for an input of all nulls.

If we really wanted "null begets null" logic, then any input containing a least one null in an input containing no true values, even if it contains false values, should return null rather than false. (Actually, if at least one value is a null in the input then arguably null-begets-null logic should return null if you think of ANY as implemented semantically as a series of EQUAL and OR.) It's like nulls mean "ignore nulls unless input is all nulls or empty" which is not the typical behavior for libcudf. libcudf usually implements nulls-begets-nulls or ignore-nulls logic, and this is a weird hybrid of those.

The main issue I have with keeping it as-is is that the API is cumbersome to wield correctly and easy to misuse in a undefined behavior way for these corner cases. How many callers will expect the behavior to act like just about any (no pun intended) version of this API and blindly reference the uninitialized scalar memory because they had no idea this could return a null scalar? For example, Python, Scala, Pandas, and Spark all have these predicate functions, and they don't return nulls for these predicates. The latter two are using three-valued logic.

bdice commented 1 year ago

The Pandas and Spark comparisons are significant comparisons here because of their three-valued logic. The Python/Scala/C++ comparisons are not, in my opinion, because they don't share the same logical model. If Pandas/Spark APIs agree with one another, I can support the change to match them. I have minor reservations about a lot of the discussion above, but I don't feel strongly enough to oppose merging #12279. I retracted my request for changes on the PR.