Open ttnghia opened 1 year ago
Targeting 23.04 release.
I would like to connect this issue and https://github.com/rapidsai/cudf/issues/12786. I would like to identify any libcudf algorithms that generate nonempty nulls and only add sanitization where it is needed due to implementation details for particular algorithms.
To add some context to this work, we recently added null sanitization checks in #14559, and also started simplifying nulls checking in #13312.
It's also worth mentioning that purge_nonempty_null
has caused degenerate performance with copying columns with deeply nested structs (>16 levels). See #14363 for a patch that helped us avoid null sanitization for columns with no parent nulls. The performance observations are captured in issue #TBD.
We should continue to minimize the usage of purge_nonempty_nulls
, as part of following our policy "libcudf expects nested types to have sanitized null masks" in the developer guide).
There were reported performance regressions in
make_lists_column
andmake_structs_column
recently after calling topurge_nonempty_nulls
has been added to these factory functions. We need to sanitize (i.e., remove non-empty nulls) for the input data but both checking and removing non-empty nulls may incur some (even significant) overhead.I propose adding a parameter to the callers of
purge_nonempty_nulls
such as:By having such parameter (
bool sanitize_input
), we can make the calls topurge_nonempty_nulls
optional. In some places such as data IO or some custom kernel, we know for sure that all the nulls are empty thus we will not have to waste the overhead of checking non-empty nulls.