Design and implementation of random forest nearest neighbors (RF-NN)

We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different sklearn base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the R yaImpute package, we rely heavily on their implementation details.

Fitting RF-NN models

One or more y attributes (along with multiple X covariates) are used to fit different random forests, notably one forest per each y attribute. In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).

Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are q y attributes passed, q different forests of n trees will be created. For each reference observation passed to fit, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of the fit process is both the specification of the forests that were created (to run additional targets) and the "node matrix" of p rows (reference plots) by q * n columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.

Predicting targets using node similarity

Functionally, the predict process for each new target is very similar to standard random forests, although the target will be run through all q forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a (q * n) vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find the k nearest neighbors for each target.

Estimator naming

An initial suggestion for the naming of this estimator would be RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.

Design considerations

fit

We can rely on RandomForestClassifier to create the forests for each y attribute passed. As of now, it seems reasonable to have RandomForestNNRegressor composed of q RandomForestClassifiers rather than inherit from it. fit would introduce two estimator attributes: 1) the forests themselves (proposed name: forests_); and 2) the node matrix of the data that was used to fit the model (proposed name: node_matrix_). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored in self.forests_.

kneighbors

We will likely need to introduce a new method with the same signature as estimators derived from TransformedKNeighborsMixin, i.e. def kneighbors(self, X=None, n_neighbors=None, return_distance=True). The method would run X through self.forests_, capture a node ID matrix for X and then row-wise calculate distances from self.node_matrix. The return value would be the (optional) distances and row indexes of the n_neighbors. Note that n_neighbors will be set on the initialization of RandomForestNNRegressor just like our other estimators, but kneighbors can override this by passing a value to n_neighbors.

predict

Initially we had thought we may possibly be able to derive from KNeighborsRegressor if we use the callable weights parameter, because given weights and y attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive from TransformedKNeighborsMixin. However, in looking at that implementation more closely, predict still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a new predict function that is simply np.average with a weights parameter that could represent the inverse distances calculated from kneighbors.

Timing

As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.

Other possible enhancements

Rather than capturing a flat node matrix that is the combination of all forests and all trees within forests, we could capture a node matrix for each forest such that it behaves like an "axis", more akin to our other estimators. Distances could then be calculated per axis and n-dimensional Euclidean distance could yield neighbors. I believe this would be a new twist on RF-NN that is not implemented in yaImpute.
Many of the other estimators (e.g. GNNRegressor and MSNRegressor) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics from RandomForestClassifier.

Thanks for an awesome write-up, @grovduck! This is a huge help to get me up to speed on the topic. I suspect this will be a long discussion and I'll have more questions as I get a better understanding of the details, but a few things occur to me off the bat.

To begin with, am I right in assuming that we want to be able to accurately reproduce the results of RF-NN as implemented in yaImpute? In other words, no changes to the basic algorithm?

In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).

Is the conversion from continuous to categorical targets up to the user? Otherwise, that seems challenging to handle flexibly with an automated approach. Should we consider offering a RandomForestRegressor based option that uses continuous targets? Or maybe a hybrid option that uses classifier forests for categorical and regressor forests for continuous targets? I'm not sure what the API would look like for that, but just a thought.

RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.

My first thought is that RFNNRegressor is consistent with the slightly-less explicit direction we use for GNNRegressor and MSNRegressor, but I don't mind RandomForestNNRegressor either.

As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.

Makes sense to me!

As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances.

This is a very interesting idea. Given that there is a forest for each target, it does seem like there's a strong potential for poorly performing forests to add a lot of noise to predictions. Would it make sense to weight forests based on an accuracy metric, such that targets that aren't predicted well will contribute less to the neighbor selection? Would this be handled through an optional argument on predict, or an initialization parameter of the estimator, or something else?

lemma-osu / sknnr