Consistent handling of NaNs in where reductions

ianthomas23 commented 1 year ago

Now that we have support for where reductions (#1214 for Dask, CUDA to follow) there is a question of how to handle NaNs. We need to decide on a consistent approach.

1. Background

ds.max("value")

When this is used as an agg in a Canvas aggregating function (e.g. Canvas.points) the return is an array of pixels, each pixel containing the maximum value of the "value" column of the supplied DataFrame of all columns that contribute to that pixel. If a DataFrame row has a "value" of NaN then it is skipped so that it does not contribute to the aggregation.

2. `where` returning row index

ds.where(ds.max("value"))

This is the simplest example of the new where reduction. Each pixel of the returned array is the index of the row (integer >= 0) containing the maximum of the "value" column. A "value" of NaN is skipped as above so that row does not contribute to the returned array. Pixels for which there are no contributing rows contain -1 to indicate no data.

3. `where` returning another column

This is the scenario we need to decide about.

ds.where(ds.max("value"), "other")

Here both "value" and "other" are DataFrame columns. Let's assume they are both float64 and may contain NaNs.

If "value" is NaN then consistent with item 2 above we skip the row. It doesn't matter what "other" is here.

If "value" is not NaN but "other" is, there are two possibilities for how to handle this.

Option A

If "other" is NaN, skip the row. Hence if either of "value" or "other" are NaN we skip the row. This needs look-ahead functionality to check what "other" is before we deal with ds.max("value"). The returned array can contain NaNs, which correspond to pixels for which there are no contributing rows.

Option B

If "other" is NaN treat it in the same way as any "other" value. This therefore doesn't need the lookahead of "other", we just blindly copy the "other" if it corresponds to the maximum "value". The returned array can contain NaNs. These correspond either to pixels for which there are no contributing rows, or pixels for which the only contributing rows have a non-NaN "value" but a NaN "other".

It seems to be that option A is the most logical. But I can see an argument for option B.

jbednar commented 1 year ago

Thanks for raising this; it's important to be consistent about it!

For 1, seems important to mention that the final result for a pixel will be NaN if no non-NaN values are found.

For 2, I guess the reason to return -1 instead of NaN is because the type is typically integer? I'm not sure if xarray now supports Panda's nullable types so that even integers can have a null value, but that sure would be convenient.

For 3, to make it concrete, let's consider a case where "value" is a magnitude of some sort, while "other" is an id. Then with option B, we'd get a NaN if the row with the max "value" is missing its id field. while for option A we'd get the id of the max "value" that does have an id. Is that right?

If so I can imagine cases for which either A or B is appropriate, i.e. do I want some datapoint that I can look up, as close to the max as possible (A), or do I want the true max, even if it doesn't have an id (B). I think B is a safer default, though it might be useful to offer A for people who don't truly much care about the max in particular but just want some explorable ids.

ianthomas23 commented 1 year ago

For 1, seems important to mention that the final result for a pixel will be NaN if no non-NaN values are found.

Agreed.

For 2, I guess the reason to return -1 instead of NaN is because the type is typically integer? I'm not sure if xarray now supports Panda's nullable types so that even integers can have a null value, but that sure would be convenient.

Yes, -1 is our "no data" value as row indexes are int64 and always >=0 so I chose -1 at it is the easiest number to use outside of this range. Any pandas or xarray support for nullable integers doesn't help us, within datashader we have to use numpy arrays with scalar dtypes as that is what is supported by numba.

For 3, to make it concrete, let's consider a case where "value" is a magnitude of some sort, while "other" is an id. Then with option B, we'd get a NaN if the row with the max "value" is missing its id field. while for option A we'd get the id of the max "value" that does have an id. Is that right?

Yes.

If so I can imagine cases for which either A or B is appropriate, i.e. do I want some datapoint that I can look up, as close to the max as possible (A), or do I want the true max, even if it doesn't have an id (B). I think B is a safer default, though it might be useful to offer A for people who don't truly much care about the max in particular but just want some explorable ids.

We could offer both via a keyword argument to a where reduction, but I would rather not do this, or at least put it off for now. Just thinking of a possible name for such a kwarg is problematic, the best I have so far is lookup_column_ignore_nan or lookup_column_ignore_invalid :worried:.

So if we pick one now, let's go for B as that is your preference and the safer default. It is also easier to explain and easier to implement. It will be a one-liner in user code to drop rows for which an explorable id does not exist and hence obtain option A. When we get round to documenting some examples of this functionality we can include a section about these options and how to obtain and use them.

holoviz / datashader