Open grovduck opened 1 year ago
Thanks for an awesome write-up, @grovduck! This is a huge help to get me up to speed on the topic. I suspect this will be a long discussion and I'll have more questions as I get a better understanding of the details, but a few things occur to me off the bat.
To begin with, am I right in assuming that we want to be able to accurately reproduce the results of RF-NN as implemented in yaImpute
? In other words, no changes to the basic algorithm?
In yaImpute, each forest is actually a classification (i.e. RandomForestClassifier) - the y attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).
Is the conversion from continuous to categorical targets up to the user? Otherwise, that seems challenging to handle flexibly with an automated approach. Should we consider offering a RandomForestRegressor
based option that uses continuous targets? Or maybe a hybrid option that uses classifier forests for categorical and regressor forests for continuous targets? I'm not sure what the API would look like for that, but just a thought.
RandomForestNNRegressor. We could also go with RFNNRegressor as the term RFNN is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.
My first thought is that RFNNRegressor
is consistent with the slightly-less explicit direction we use for GNNRegressor
and MSNRegressor
, but I don't mind RandomForestNNRegressor
either.
As this estimator deviates substantially from the other estimators already introduced, I propose we first merge fb_add_estimators back into main and treat this as a separate feature branch.
Makes sense to me!
As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances.
This is a very interesting idea. Given that there is a forest for each target, it does seem like there's a strong potential for poorly performing forests to add a lot of noise to predictions. Would it make sense to weight forests based on an accuracy metric, such that targets that aren't predicted well will contribute less to the neighbor selection? Would this be handled through an optional argument on predict
, or an initialization parameter of the estimator, or something else?
Note: remove the "in development" disclaimer from the README once RFNN
is merged.
We started to discuss the design and implementation of RF-NN in #22, but thought it would be better to track in a separate issue. Because RF-NN is quite different than all other estimators in this package in terms of how distances are calculated, we'll need to likely use different
sklearn
base classes or derive our own to implement this. The goal of this issue is to lay out the design of RF-NN and decide how best to tackle it. As RF-NN was first introduced in the RyaImpute
package, we rely heavily on their implementation details.Fitting RF-NN models
One or more
y
attributes (along with multipleX
covariates) are used to fit different random forests, notably one forest per eachy
attribute. InyaImpute
, each forest is actually a classification (i.e.RandomForestClassifier
) - they
attributes passed are either categorical attributes (something like vegetation class) or continuous attributes that are binned into classes using some classification mode (equal interval, quantile, natural breaks, etc.).Somewhat non-intuitively, in RF-NN, the actual values of the terminal nodes (i.e. the class values) in each random forest doesn't matter, as we only care about the terminal nodes' IDs of the forests' trees where the references land. For example, if there are
q
y
attributes passed,q
different forests ofn
trees will be created. For each reference observation passed tofit
, we want to capture the terminal node ID that it falls into for each tree and each forest. Therefore, the needed output of thefit
process is both the specification of the forests that were created (to run additional targets) and the "node matrix" ofp
rows (reference plots) byq * n
columns that stores node IDS. We'll be leaning on the apply method of each forest to return the leaf node IDs.Predicting targets using node similarity
Functionally, the
predict
process for each new target is very similar to standard random forests, although the target will be run through allq
forests rather than just a single forest. As with the references, we only use the IDs of the terminal nodes that the target falls into - therefore, we get a(q * n)
vector of node IDs as a result of passing a target through all forests. At this point, we use a node similarity metric to define the distances from the target to each candidate reference plot. As opposed to all other estimators that we've introduced so far which have a neighbor space greater than one dimension, distances in RF-NN are calculated as the inverse of the number of nodes that the reference and target share in common. Neighbors are ranked on this distance to find thek
nearest neighbors for each target.Estimator naming
An initial suggestion for the naming of this estimator would be
RandomForestNNRegressor
. We could also go withRFNNRegressor
as the termRFNN
is somewhat well understood by those in the imputation community. To be explict, I propose the former name and use it in subsequent discussion.Design considerations
fit
We can rely on
RandomForestClassifier
to create the forests for eachy
attribute passed. As of now, it seems reasonable to haveRandomForestNNRegressor
composed ofq
RandomForestClassifier
s rather than inherit from it.fit
would introduce two estimator attributes: 1) the forests themselves (proposed name:forests_
); and 2) the node matrix of the data that was used to fit the model (proposed name:node_matrix_
). We'll also likely want to have easy access to a few counts, e.g. number of forests and number of trees in a forest. These could easily be properties of the class derived from information stored inself.forests_
.kneighbors
We will likely need to introduce a new method with the same signature as estimators derived from
TransformedKNeighborsMixin
, i.e.def kneighbors(self, X=None, n_neighbors=None, return_distance=True)
. The method would runX
throughself.forests_
, capture a node ID matrix forX
and then row-wise calculate distances fromself.node_matrix
. The return value would be the (optional) distances and row indexes of then_neighbors
. Note thatn_neighbors
will be set on the initialization ofRandomForestNNRegressor
just like our other estimators, butkneighbors
can override this by passing a value ton_neighbors
.predict
Initially we had thought we may possibly be able to derive from
KNeighborsRegressor
if we use the callable weights parameter, because given weights andy
attributes, the actual calculation of predicted attributes will be exactly the same as in all estimators that derive fromTransformedKNeighborsMixin
. However, in looking at that implementation more closely,predict
still relies on calculating the n-dimensional distances and then applying the callable function to the distance array. We are probably better off by introducing a newpredict
function that is simplynp.average
with a weights parameter that could represent the inverse distances calculated fromkneighbors
.Timing
As this estimator deviates substantially from the other estimators already introduced, I propose we first merge
fb_add_estimators
back intomain
and treat this as a separate feature branch.Other possible enhancements
yaImpute
.GNNRegressor
andMSNRegressor
) apply dimensionality reduction such that the resultant axes are ordered based on the amount of variation explained. The eigenvalues from these axes are typically used to weight these axes when finding neighbors. As currently written, RF-NN weights all forests the same (as they are captured in a single node matrix). There may be a measure of forest importance (akin to variable importance in the forest themselves) that may provide weights when calculating distances. I've not seen anything to capture this, but also have not looked very closely at all diagnostics fromRandomForestClassifier
.