Random forests with missing X-values

swager commented 5 years ago

Tree-based methods are known for their ability to seamlessly handle missing values in the feature matrix X. We should make GRF able to handle such missingness also. As a first attempt, we plan to implement the MIA method proposed by Twala et al (2008).

At a high level, the MIA approach can be thought of as creating two copies of each feature Xj with missingness: One where the missing values are coded as -infinity, and one where the missing values are coded as +infinity. CART splitting then proceeds as usual.

The above simple implementation, however, is not desirable from a memory-use standpoint, as it involves doubling the number of features. We should thus add to GRF an implementation of MIA splitting that doesn't involve physically duplicating the data.

Twala, B. E. T. H., M. C. Jones, and David J. Hand. "Good methods for coping with missing data in decision trees." Pattern Recognition Letters 29.7 (2008): 950-956.

erikcs commented 5 years ago

The above simple implementation, however, is not desirable from a memory-use standpoint, as it involves doubling the number of features. We should thus add to GRF an implementation of MIA splitting that doesn't involve physically duplicating the data.

The Josse et al (2019) replication code duplicates the data, the same is done in partykit's implementation of MIA

We can implement that same logic in a new Data() class with zero extra memory requirements by implementing a new .get() (or operator[]) which does the appropriate index arithmetic (offsetting by number of columns and return the appropriate NaN placeholder (-/+ Infinity) depending on whether we are "on the left or right"):

data = MIAData(100, 10)
data.get(3, 12):
    return data[3, 2] if that entry is not NA, otherwise plus Infinity
    # and symmetric for index (3, 2) except we return minus Infinity in case missing.

(am working on the details).

Josse, Julie, et al. "On the consistency of supervised learning with missing values." arXiv preprint arXiv:1902.06931 (2019).

erikcs commented 5 years ago

Regarding tree visualizations: it's not that straight forward to take MIA into account?

swager commented 5 years ago

Following up on this (we had discussed in person a while ago, but just want to make sure we have a record on here also): Duplicating the data works fine, but is quite inefficient computationally. Also, it doesn't play so nicely with feature sampling. So our plan here is to develop an efficient implementation of MIA that doesn't require data duplication.

erikcs commented 4 years ago

A quick note: on data duplication (even without extra memory overhead) when mtry is less than the number of columns both NA split directions might not be explored. This should not be a feature in the efficient splitting rule modification where both directions should always be explored.

grf-labs / grf

Random forests with missing X-values #457