kapelner / bartMachine

An R-Java Bayesian Additive Regression Trees implementation
MIT License
61 stars 27 forks source link

Is there a way to access individual trees? #36

Closed bakaburg1 closed 3 years ago

bakaburg1 commented 3 years ago

Hello,

For interpretability reasons it could be helpfull to access individual trees in the model. Is there a way to do this?

Thank you very much!

kapelner commented 3 years ago

What kind of access do you want specifically?

On Thu, Dec 17, 2020 at 5:13 AM bakaburg1 notifications@github.com wrote:

Hello,

For interpretability reasons it could be helpfull to access individual trees in the model. Is there a way to do this?

Thank you very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/36, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6BY7S6OMGHFWZEBLPDSVHKVRANCNFSM4U7MNJSQ .

-- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar https://scholar.google.com/citations?user=TzgMmnoAAAAJ|research gate http://www.researchgate.net/profile/Adam_Kapelner2|publons https://publons.com/researcher/431881/adam-kapelner/peer-review/|linkedin https://www.linkedin.com/in/adam-kapelner/)

bakaburg1 commented 3 years ago

This is a good question. it depends on how are they stored internally I guess. Libraries like partykit and rpart use nested lists, and I wrote a little helper to turn them into a dataframe of rules. I would imagine there is a specific data object for each tree ensemble, for each MCMC sample, isn't it? But I didn't want you to do extra work for my sake, so if I were to know how to access them via R I would use them in the form in which you store them; unless they are stored in Java data structures, in that case I wouldn't know how to work with them.

kapelner commented 3 years ago

I wrote a simple wrapper for this now. You should pull, ant and R CMD INSTALL. Here is some sample code:

options(java.parameters = "-Xmx1500m")
library(bartMachine)
data("Pima.te", package = "MASS")
X <- data.frame(Pima.te[, -8])
y <- Pima.te[, 8]
bart_machine = bartMachine(X, y)
bart_machine
table(y, predict(bart_machine, X, type = "class"))

raw_node_data = extract_raw_node_data(bart_machine, g = 37)
raw_node_data[[17]]

This pulls the 37th Gibbs sample after burn-in and then pulls the raw info for the 17th tree. The raw info is provided as a list (with nested left/right lists if applicable) and pointers to the raw objects if you want to use .jcall to further inspect them. Learning the lingo takes some getting used to, but you should be rolling pretty quickly.

bakaburg1 commented 3 years ago

Thanks!

kapelner commented 3 years ago

Forgot to include the output example above:

$java_obj
[1] "Java-Object{bartMachine.bartMachineTreeNode@32a1bec0}"

$parent
[1] "Java-Object<null>"

$left_java_obj
[1] "Java-Object{bartMachine.bartMachineTreeNode@22927a81}"

$right_java_obj
[1] "Java-Object{bartMachine.bartMachineTreeNode@78e03bb5}"

$depth
[1] 0

$isLeaf
[1] FALSE

$sendMissingDataRight
[1] FALSE

$n_eta
[1] 332

$string_id
[1] "32a1bec0"

$is_stump
[1] FALSE

$string_location
[1] "P"

$splitAttributeM
[1] 5

$splitValue
[1] 0.527

$y_pred
[1] NA

$y_avg
[1] NA

$posterior_var
[1] NA

$posterior_mean
[1] NA

$left
$left$java_obj
[1] "Java-Object{bartMachine.bartMachineTreeNode@22927a81}"

$left$parent
[1] "Java-Object{bartMachine.bartMachineTreeNode@32a1bec0}"

$left$left_java_obj
[1] "Java-Object<null>"

$left$right_java_obj
[1] "Java-Object<null>"

$left$depth
[1] 1

$left$isLeaf
[1] TRUE

$left$sendMissingDataRight
[1] FALSE

$left$n_eta
[1] 202

$left$string_id
[1] "22927a81"

$left$is_stump
[1] FALSE

$left$string_location
[1] "L"

$left$splitAttributeM
[1] NA

$left$splitValue
[1] NA

$left$y_pred
[1] 0.3494151

$left$y_avg
[1] 0.2838487

$left$posterior_var
[1] 0.004459861

$left$posterior_mean
[1] 0.255717

$left$left
[1] NA

$left$right
[1] NA

$right
$right$java_obj
[1] "Java-Object{bartMachine.bartMachineTreeNode@78e03bb5}"

$right$parent
[1] "Java-Object{bartMachine.bartMachineTreeNode@32a1bec0}"

$right$left_java_obj
[1] "Java-Object<null>"

$right$right_java_obj
[1] "Java-Object<null>"

$right$depth
[1] 1

$right$isLeaf
[1] TRUE

$right$sendMissingDataRight
[1] TRUE

$right$n_eta
[1] 130

$right$string_id
[1] "78e03bb5"

$right$is_stump
[1] FALSE

$right$string_location
[1] "R"

$right$splitAttributeM
[1] NA

$right$splitValue
[1] NA

$right$y_pred
[1] 0.09574023

$right$y_avg
[1] 0.02535882

$right$posterior_var
[1] 0.006569343

$right$posterior_mean
[1] 0.02165681

$right$left
[1] NA

$right$right
[1] NA
bakaburg1 commented 3 years ago

Thanks!

IMHO I think that trees can be easily represented by dataframes with a node per row, including all the necessary information like the number of observations in the node, depth, predicted values, various split statistics, etc. Also the split rules that identify the nodes can be stored as an "evaluable" string (eg.: "Sex %in% 'Male' & Age >= 18 & Category %in% c('A', 'B', 'C')" etc) or as "list in a cell". I like this approach because it's compact, dplyr friendly for edits, human-readable (e.g. you can order rules by predicted values) and makes it easy to use tree rules to subset data with eval(str2expression(rule)). It can easily be extended to Bayesian trees (adding a column which identifies the mcmc sample) and probably also to additive trees (a column to identify the tree ensemble?).

As I told you I'm building a small package which parses trees from many methods (partykit and part at the moment, bartMachine will follow) into such general format but I hope it would become a standard.

kapelner commented 3 years ago

You have the raw information. You can now iterate over it and convert it to whatever format you wish.

On Sat, Dec 19, 2020 at 6:01 AM bakaburg1 notifications@github.com wrote:

Thanks!

IMHO I think that trees can be easily represented by dataframes with a node per row, including all the necessary information like the number of observations in the node, depth, predicted values, various split statistics, etc. Also the split rules that identify the nodes can be stored as an "evaluable" string (eg.: "Sex %in% 'Male' & Age >= 18 & Category %in% c('A', 'B', 'C')" etc) or as "list in a cell". I like this approach because it's compact, dplyr friendly for edits, human-readable (e.g. you can order rules by predicted values) and makes it easy to use tree rules to subset data with eval(str2expression(rule)). It can easily be extended to Bayesian trees (adding a column which identifies the mcmc sample) and probably also to additive trees (a column to identify the tree ensemble?).

As I told you I'm building a small package which parses trees from many methods (partykit and part at the moment, bartMachine will follow) into such general format but I hope it would become a standard.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kapelner/bartMachine/issues/36#issuecomment-748458166, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFAV6EYVY43VGKOOKS5YILSVSBZHANCNFSM4U7MNJSQ .

-- Adam Kapelner, Ph.D. Assistant Professor of Mathematics Director of the Undergraduate Data Science and Statistics Program Queens College, City University of New York 65-30 Kissena Blvd., Kiely Hall Room 604 Flushing, NY, 11367 M: 516-435-6795 kapelner.com (scholar https://scholar.google.com/citations?user=TzgMmnoAAAAJ|research gate http://www.researchgate.net/profile/Adam_Kapelner2|publons https://publons.com/researcher/431881/adam-kapelner/peer-review/|linkedin https://www.linkedin.com/in/adam-kapelner/)

bakaburg1 commented 3 years ago

Yes, sure. I just wanted to share a general idea about tree structures. Thanks