BUG(?) Missing-values in RandomForest only during inference time shouldn't send missing-values to the child with most samples

adam2392 commented 4 months ago

Currently, when missing-values occur only in the testing dataset for constructing a RandomForest, there is a policy that the missing values are sent to the child with the most samples. This amounts to in some sense imputing the missing-value data using the data in the child with the most samples. An issue here is that this may bias the tree prediction towards say a class in the training dataset with more samples.

For example, say there are 1000 training samples of class 1 and 10 training samples of class 0, and then during test time there are some NaNs. The predictions would then bias towards class 1, whereas it should really be uninformative because the NaNs during test time are treated as missing completely at random.

Proposed Solution

However, an alternative and more sensible strategy is that when NaNs are not enountered during training, but show up in testing data, they should just be sent stochastically down the tree using weights:

p_left_child = n_left_samples / (n_left_samples + n_right_samples)
p_right_child = n_right_samples / (n_left_samples + n_right_samples)

This ensures that there is no bias towards the class "with more samples". This can be implemented by allowing the value of missing_go_to_left (https://github.com/scikit-learn/scikit-learn/blob/6bf0ba5257d55ea0f3c89d7c4762096d84c052cc/sklearn/tree/_splitter.pxd#L28) to be 2. If the value is 2, it implies that missing-values were not observed during training time, and thus should be stochastically set.

Overall, it's a very simple change, and I can also implement relevant unit-tests.

cc: @thomasjpfan who implemented the original missing-value support in RandomForest.

To summarize, it seems the old issue stated they replaced missing-values during test (that wasn't seen during training) to '0'. This seems like a reasonable imputation if the feature was 0-centered, but seems a bit odd overall. Having a stochastic treatment as I specified in the proposed soln would:

still work similarly to replacing with 0's when the data is 0-centered
(more importantly) would work better than replacing with 0's when the data is not 0-centered because of the issues outlined in the issue description.

Lmk if there's some analysis I'm missing wrt LightGBM on perhaps why this behavior is the way it is.

glemaitre commented 4 months ago

Is there any overhead by having the stochastic process? For instance, in case of tie breaking in nearest-neighbors, we did not use a stochastic process because it would be detrimental in terms of computing performance with a huge regression.

adam2392 commented 4 months ago

I imagine there is some overhead due to the potential to query a RNG many times. Tho this would potentially be also an edge case since it implies there are many missing-values seen during test that were not seen during training. I.e. the only times we would stochastic process left/right is when a specific feature is used that does not have missing values encountered for that specific tree's node depth in the training data. I personally think this edge-case is less worrisome than the bias point I raised, but it's plausible to occur in practice of course.

I think the situation you describe is a good one tho and makes the point that the situation may be more complex. It is presumable this could occur in practice tho. Some ideas:

We treat it like an edge-case and simply the fact that it would not come up as often as tie-breaking in nearest-neighbors and just allow stochastic processing the test data
We allow the user to specify such behavior during training time, so that way missing_go_to_left can be set accordingly if there are no missing-values encountered.
We allow the user to specify such behavior during runtime. I'm unsure how we could best implement this behavior tho, so would have to take a look at the code.

betatim commented 4 months ago

If it is easy to implement the "make a weighted random choice" option (what you propose) then we could run a benchmark to see what the performance hit is for different cases (many missing values in many features, many missing values in a few features, few missing values in few features, etc). That would give us a way to decide if performance is/isn't something to worry about when making a decision.

glemaitre commented 4 months ago

In terms of an application use case, I'm also wondering if we should not error/warn if a user starts to provide missing values at test time while we did not see any during training. @adam2392 do you have a legitimate use case where this is kind of normal.

adam2392 commented 4 months ago

I think a warning would be nice but an error message might be overkill because when a NaN pops up in the testing dataset (but not training), ideally our use case is we would like to ignore its effect. If we errored out, we would need to fill in the NaN value or drop the sample altogether.

OTOH, the issue rn is that by sending it down the child with more samples, it just biases the NaN towards a class (in classification). I also see a case where maybe one would like the current behavior. It all boils down to the assumption on why the NaNs show up. Ideally, both settings should be supported if it's not too complex.

I can do a quick implementation and try it out on a benchmark tho before continuing the discussion?

NicolasHug commented 4 months ago

I don't think we should raise a warning, unless we can identify clear cases where missing values during test time is clearly unexpected.

There's no right or wrong strategy here I believe but we should note that the strategy suggested above means that predict(X) called twice on the same X may lead to different predictions. Do we have other predictors in scikit-learn that are inherently random (excluding tie-breaking etc.)?

adam2392 commented 2 months ago

Responding to @glemaitre

In terms of an application use case, I'm also wondering if we should not error/warn if a user starts to provide missing values at test time while we did not see any during training. @adam2392 do you have a legitimate use case where this is kind of normal.

and @NicolasHug

There's no right or wrong strategy here I believe but we should note that the strategy suggested above means that predict(X) called twice on the same X may lead to different predictions. Do we have other predictors in scikit-learn that are inherently random (excluding tie-breaking etc.)?

I've thought about it a bit more. I think rn it is the right strategy only if missing-values occurred Missing-At-Random (MAR). If the missing-values during test time occurred Missing-Completely-At-Random (MCAR), in fact this is the wrong strategy and a very wrong one.

MAR assumes there is missing-ness dependent on groups in the observed data. In this case, if missing-values solely occurred dependent on covariates in the observed data, then it seems sensible (tho I could be missing additional nuances) to send down the branch w/ the most samples.

However, MCAR assumes missing-ness is completely random in the observed test dataset. If all of a sudden missing-values occurred within the test sample, then this is a pretty reasonable assumption. Consider for example, you spend a lot of time collecting clean training data, and then release. Missing values come in at random due to just instrument noise tho at test time because you're not in an experimental setup anymore. Therefore, missing-values should not influence the predictions even at test time.

This in fact is probably not an edge case, and is definitely plausible in a lot of other use-cases. I suspect ppl do hacks to get around it ofc like imputing the data, and such. However, you already have the tree structure, which should and can be leveraged.

Hypothesis: If we generate training/testing data from the same population data generating model, and then induce missingness on the test set completely at random, we will see biased predictions and posterior probabilities.

adam2392 commented 1 month ago

Relevant paper showing empirical evidence that that sending samples to the majority child node is not as good as "random" when the sample contains an unseen category during training. In the context of missing values, a missing value indicator is a binary category.

https://jmlr.csail.mit.edu/papers/volume19/16-474/16-474.pdf#page14

summary

The paper looks at categorical splits and not seeing a certain category during training and what to do during predict time. This is termed the "absent level" problem. This is analogous to not seeing missing values during training but during testing. Which child node do we send that sample down?

The author runs experiments showing that randomly sending the sample with an absent level is the best unbiased heuristic.

I think it's reasonable to impose that logic during test time.

proposal

Extend the missing_go_to_left parameter with a third possibility where it indicates that the sample should be randomly sent down the tree. This only occurs if no missing values during training but there are missing values during prediction.

scikit-learn / scikit-learn