rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.25k stars 883 forks source link

[FEA] Make calling to `purge_nonempty_nulls` optional in various places #12567

Open ttnghia opened 1 year ago

ttnghia commented 1 year ago

There were reported performance regressions in make_lists_column and make_structs_column recently after calling to purge_nonempty_nulls has been added to these factory functions. We need to sanitize (i.e., remove non-empty nulls) for the input data but both checking and removing non-empty nulls may incur some (even significant) overhead.

I propose adding a parameter to the callers of purge_nonempty_nulls such as:

superimpose_nulls(..., std::unique_ptr<column>&& input, bool sanitize_input, ....);

By having such parameter (bool sanitize_input), we can make the calls to purge_nonempty_nulls optional. In some places such as data IO or some custom kernel, we know for sure that all the nulls are empty thus we will not have to waste the overhead of checking non-empty nulls.

ttnghia commented 1 year ago

Targeting 23.04 release.

GregoryKimball commented 1 year ago

I would like to connect this issue and https://github.com/rapidsai/cudf/issues/12786. I would like to identify any libcudf algorithms that generate nonempty nulls and only add sanitization where it is needed due to implementation details for particular algorithms.

GregoryKimball commented 8 months ago

To add some context to this work, we recently added null sanitization checks in #14559, and also started simplifying nulls checking in #13312.

It's also worth mentioning that purge_nonempty_null has caused degenerate performance with copying columns with deeply nested structs (>16 levels). See #14363 for a patch that helped us avoid null sanitization for columns with no parent nulls. The performance observations are captured in issue #TBD.

We should continue to minimize the usage of purge_nonempty_nulls, as part of following our policy "libcudf expects nested types to have sanitized null masks" in the developer guide).