Closed jlowe closed 1 year ago
This doesn't present in the Python layer, perhaps due to additional null handling.
import cudf
s = cudf.Series([])
print(s.all())
s = cudf.Series([None, None])
print(s.all())
True
True
The result is actually an invalid scalar as described in the documentation for cudf::reduce
https://docs.rapids.ai/api/libcudf/stable/group__aggregation__reduction.html#gae4a6e7b821f22ac479d15e8e394fc640
The result is actually an invalid scalar as described in the documentation for cudf::reduce
Thanks @davidwendt, I suspect the confusion then is around the fact that the all
aggregation can "fail" which seems odd to me. I would argue this aggregation should never fail on a boolean column and always produce a valid literal, because it should return true if none of the values were false
. Nulls are skipped, so an empty or all-nulls column are equivalent, neither contains a false
value, so therefore the reduction seems well-defined there.
I think supporting different results for empty/all-null for a general cudf::reduce
API depending on the aggregation type may be a challenge. Perhaps this could be a feature request for a cudf::all_of
(like std::all_of
and thrust::all_of
) API.
There is an internal cudf::reduction::all()
function that cudf::reduce
actually calls for the all
aggregation that I believe may already return the desired result. Right now, the empty/all-null case is being handled by the general cudf::detail::reduce
function for all aggregation types (as documented). So the new cudf::all_of
API would potentially only need to be wrapper on the existing cudf::reduction::all()
.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Still desired to see a cudf::all_of
/ cudf::any_of
or other fix for this issue, as we continue to put workarounds in our code to make sure we check the validity of the reduction result before examining the reduction value. Low priority.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Still desired, low priority.
edit: Well, oops. I made an error below. Kleene logic is not the same as "nulls beget nulls." Leaving this comment for posterity.
I discussed this with @davidwendt yesterday. We compared the behavior of cudf::reduce
with any
on a column of two rows with the behavior of binary ops OR
between two columns. I expected both to follow the Kleene three-valued logic truth table for disjunction (the OR operator) but it turns out that reductions are not adhering to this logic. This asymmetry was not what I expected. Across most of cudf, we tend to adhere to ~Kleene logic~ "nulls beget nulls." Reductions are currently an exception to this, and will remain an exception (but in a different way) if #12279 is merged. Below is a table with our findings. Changes between the current/proposed reduction results are bolded.
Any/or input | "Or" Binops Result ~(Kleene logic)~ | Current "Any" Result | Proposed "Any" Result (#12279) |
---|---|---|---|
[true, true] | true | true | true |
[true, null] | null | true | true |
[true, false] | true | true | true |
[null, true] | null | true | true |
[null, null] | null | null | false |
[null, false] | null | false | false |
[false, true] | true | true | true |
[false, null] | null | false | false |
[false, false] | false | false | false |
[true] | true | true | |
[null] | null | false | |
[false] | false | false | |
[] | null | false |
We briefly discussed that the proposed changes should retain associativity, and it seems like it does (any/all operators are trivially commutative).
I am not convinced that any/all reductions should necessarily behave the same as std::any_of
/ std::all_of
, because cuDF's three-valued logic (true, false, null) is fundamentally different than the boolean logic of the C++ Standard Library. However, I won't oppose this change if others feel strongly in favor of it, because I think the current reduction behavior is already quite broken by not adhering to Kleene logic. This is not substantially worse, in my opinion, but I don't think the breaking change is necessary when reasonable workarounds exist: checking if the null count is equal to the size should be sufficient to catch both cases that currently return null: all-null inputs and empty inputs. Alternatively, users can check the validity of the returned scalar and replace null results with a false (valid) result to achieve the same effect as this PR.
Another downside of changing the behavior, in my view, is that any/all will never return null results, while other monoidal reductions (like sum) will continue to return null for some inputs (empty or all nulls). This is breaking a significant symmetry between operations that, from my experience, tend to have the same natural algebraic structure (sum/product/min/max/any/all).
The problem with keeping the behavior as-is is that it's not self-consistent and arguably wrong in some cases. any
means "are any of the values true". It should return false for an empty input, as there are no values that are true. NULL == TRUE is not true, so IMO it should also return false for an input of all nulls.
If we really wanted "null begets null" logic, then any input containing a least one null in an input containing no true values, even if it contains false values, should return null rather than false. (Actually, if at least one value is a null in the input then arguably null-begets-null logic should return null if you think of ANY as implemented semantically as a series of EQUAL and OR.) It's like nulls mean "ignore nulls unless input is all nulls or empty" which is not the typical behavior for libcudf. libcudf usually implements nulls-begets-nulls or ignore-nulls logic, and this is a weird hybrid of those.
The main issue I have with keeping it as-is is that the API is cumbersome to wield correctly and easy to misuse in a undefined behavior way for these corner cases. How many callers will expect the behavior to act like just about any (no pun intended) version of this API and blindly reference the uninitialized scalar memory because they had no idea this could return a null scalar? For example, Python, Scala, Pandas, and Spark all have these predicate functions, and they don't return nulls for these predicates. The latter two are using three-valued logic.
The Pandas and Spark comparisons are significant comparisons here because of their three-valued logic. The Python/Scala/C++ comparisons are not, in my opinion, because they don't share the same logical model. If Pandas/Spark APIs agree with one another, I can support the change to match them. I have minor reservations about a lot of the discussion above, but I don't feel strongly enough to oppose merging #12279. I retracted my request for changes on the PR.
Describe the bug The
all
aggregation should returntrue
for an empty column or an entirely nulls column, but instead it returnsfalse
. Semantically anall
reduction indicates there are nofalse
values anywhere in the column, complementing theany
reduction which will only return true if there is at least onetrue
value anywhere in the column. For an empty column or entirely nulls column, there are no valid values to inspect and therefore nofalse
value to be found.Steps/Code to reproduce bug Attached is a patch that adds reduction unit tests for these cases.
all-reduction-empty-or-nulls-patch.txt
Expected behavior The
all
aggregation is essentially a search forfalse
and should returntrue
unless there is at least onefalse
value found in the column. Nulls are skipped, so an empty column or entirely nulls column means there are nofalse
values to find and therefore should result intrue
.