groupBy method ignores the rejectNA option

eepstein commented 11 years ago

Seems this is a problem with how the sum(), and in turn mean() and possibly other methods are implemented. They don't seem to detect non-numerics Except as the very first element of an array.

Use case: grouping across rows where some rows have null (or NaN) values for certain columns. Average should be across the non-null, numeric values.

It would seem from the docs that this is a feature. The code seems to indicate otherwise.

iros commented 10 years ago

Part of the problem is what one should do in this situation. How do you sum up rows that have NA's in them? Is it still valid to sum up those rows that don't have values? We cannot assume that those values should be counted as zeroes. Do you have a use case that you can suggest?

protobi commented 10 years ago

An example would be average systolic blood pressure reading across multiple patient visits. It might not be measured every time, but the patient presumably still had one that was simply unobserved.

In R, it's handled this way:

mean( c ( 0, 5, NULL, 10, NULL, 15)) -> 7.5
sum( c ( 0, 5, NULL, 10, NULL, 15)) -> 30

R also differentiates NULL from NA (analogous to null and undefined):

mean( c ( 0, 5, NA, 10, NA, 15)) -> NA
sum( c ( 0, 5, NA, 10, NA, 15)) -> NA

Surveys can have cases where you might want missing values to be treated as zero in a mean, such as average wait time in a survey with skip patterns, e.g.

"Q1. Did you have to wait for the representative [yes, no]. IF Q1='yes' then ask:
"Q2. How many minutes did you wait?"

But then the analyst would be expected to explicitly recode missings as zero, and would not expect a second kind of parameter for handling NA in the operand.

misoproject / dataset

groupBy method ignores the rejectNA option #205