rapidsai / cudf

cuDF - GPU DataFrame Library
https://docs.rapids.ai/api/cudf/stable/
Apache License 2.0
8.43k stars 904 forks source link

[TASK][JNI] Investigate train of `null_count` after `explode` #11923

Open abellina opened 2 years ago

abellina commented 2 years ago

While analyzing an nsys trace for a Spark job with deeply nested tables, we see an explode kernel call that is followed by a train of null_count, which end in is_valid.

After we call cudf::explode we build up a table, and construct java ColumnVector objects. I think the construction of these objects is triggering it.

This task is to confirm that the columns with missing a null count are coming from the explode kernels. If they are coming from explode, it would be great if explode could compute null count as part of that kernel.

In this screenshot, it is the ~20ms at the end after explode: Screenshot from 2022-10-14 11-21-24

GregoryKimball commented 2 years ago

I'd like to cross-reference this issue with #11968. It's likely that the null_count appearances in profiles will change as we refactor null_count for compatibility with user-provided streams.

jrhemstad commented 2 years ago

it would be great if explode could compute null count as part of that kernel.

I'm not following. explode returns a table: https://github.com/rapidsai/cudf/blob/7d173c9d144a64c5e1a0467d2a5eb4181854f25e/cpp/include/cudf/lists/explode.hpp#L72

Are you then constructing a column_view for each of the lists that were exploded?

If so, then yeah, you're going to have a problem with computing the null count of each of those column_views individually.

To make that efficient, you'd have to do what we do in cudf::split where we compute the individual null counts in bulk with a single segmented_null_count: https://github.com/rapidsai/cudf/blob/9c06330363db4da99803a3728b8bf44f9829f0b9/cpp/include/cudf/detail/null_mask.hpp#L186-L204