kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
117 stars 18 forks source link

Odd behavior with multivariate splitting rule specification #165

Open rstrawderman opened 2 years ago

rstrawderman commented 2 years ago

Wondering about the odd behavior with "mahalonobis" splitting rule specification as demonstrated below?

library(mlbench) data(BostonHousing) bh.mreg <- rfsrc(Multivar(lstat, nox)~., BostonHousing, importance = TRUE, splitrule = "mahal") BostonHousing$zz = cbind(BostonHousing$lstat,BostonHousing$nox) bh.mreg <- rfsrc(zz~., BostonHousing, importance = TRUE, splitrule = "mahal") Error in get.grow.splitinfo(formulaDetail, splitrule, hdim, nsplit, event.info) : Invalid split rule specified for regression: mahalanobis

ishwaran commented 2 years ago

Mahalanobis only applies to multivariate regression, whereas in your example because of the way you defined "zz" you are requesting univariate (usual) regression.

For example, the following will work just fine:

bh.mreg <- rfsrc(cbind(lstat,nox)~., BostonHousing, importance = TRUE, splitrule = "mahal")

rstrawderman commented 2 years ago

Thanks. My example inputs the response as a matrix in each case, but I do think I understand your response here - it is an interpretation issue with the formula, since in my second example it treats the 'zz' as a single variable despite having 2 columns. But if you have, say, 10 dimensions to your response variable the need to explicitly write out each one on the lefthand side does seem a little inconvenient.

ishwaran commented 2 years ago

Yes, I think it's an interpretation issue. Remember whenever you use cbind you are requesting a matrix and not a data.frame.

There are convenient ways to specify the formula as well as parse output from a multivariate forest. You might find the following vignette helpful, especially since it specifically uses Mahalanobis (see Illustration at bottom).

https://luminwin.github.io/randomForestSRC/articles/mvsplit.html

ishwaran commented 2 years ago

What I meant about cbind, is I think what you did was to create a list with 15 entries where entry "zz" is a matrix:

str(BostonHousing) 'data.frame': 506 obs. of 15 variables: $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... $ rm : num 6.58 6.42 7.18 7 7.15 ... $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... $ dis : num 4.09 4.97 4.97 6.06 6.06 ... $ rad : num 1 2 2 3 3 3 5 5 5 5 ... $ tax : num 296 242 242 222 222 222 311 311 311 311 ... $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... $ b : num 397 397 393 395 397 ... $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ... $ zz : num [1:506, 1:2] 4.98 9.14 4.03 2.94 5.33 ...

BostonHousing$zz [,1] [,2] [1,] 4.98 0.5380 [2,] 9.14 0.4690 [3,] 4.03 0.4690