martin-borkovec / ggparty

147 stars 14 forks source link

Terminal node displaying coefficients and t-ratios in table-like way #42

Closed alvarogutyerrez closed 3 years ago

alvarogutyerrez commented 3 years ago

Hi Martin, first of all, thanks for such an amazing visualization resource!

I am trying to plot a partykit object together with the parameters and t-ratios in a table-like way, and I was wondering if it is possible to use ggparty to do it.

I am working with something that looks like the following:

image

It was evident problems because the names are outside the boxes and they are not even aligned.

Here is the replication code:

library("partykit")

set.seed(1234L)
data("PimaIndiansDiabetes", package = "mlbench")
## a simple basic fitting function (of type 1) for a logistic regression
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
  glm(y ~ 0 + x, family = binomial, start = start, ...)}

## Long name regressors
PimaIndiansDiabetes$looooong_name_1 <- rnorm(nrow(PimaIndiansDiabetes))
PimaIndiansDiabetes$looooong_name_2 <- rnorm(nrow(PimaIndiansDiabetes))
## Short name regressor
PimaIndiansDiabetes$short_name <- rnorm(nrow(PimaIndiansDiabetes))

## set up a logistic regression tree
pid_tree <- mob(diabetes ~ glucose        + 
                  looooong_name_1 +
                  looooong_name_2 +
                  short_name      | 
                  pregnant + pressure + triceps + insulin +
                  mass + pedigree + age, data = PimaIndiansDiabetes, fit = logit)

## Summary function from: https://stackoverflow.com/questions/65495322/partykit-modify-terminal-node-to-include-standard-deviation-and-significance-of/65500344#65500344
mysummary <- function(info, digits = 2) {
  n <- info$nobs
  na <- format(names(coef(info$object)))
  cf <- format(coef(info$object), digits = digits)
  se <- format(sqrt(diag(vcov(info$object))), digits = digits)
  t <- format(coef(info$object)/sqrt(diag(vcov(info$object))) ,digits = digits)

  c(paste("n =", n),
    paste("Regressor","beta" ,"[", "t-ratio" ,"]"),
    paste(na, cf, "[",t,"]")
  )
}

#plot tree
plot(pid_tree,
     terminal_panel = node_terminal,
     tp_args = list(FUN = mysummary,fill = c("white")),
     gp = gpar(fontsize = 10)) 

My desired output is something like:

image

Thank you in advance!

PS: This comes from a cross-posted question in Stackoverflow

martin-borkovec commented 3 years ago

sorry, the email notification for your post got spam-filtered...

sure. all the model info is accessible through the info_list (which is a list of all the nodes' models' info entries ;) ) so you just have to rewrite your summary function a bit, set the font to a monospaced one and you are good to go.

you'll have to play around a bit with the label font size and the plot limits to get a good result.

library("partykit")
#> Loading required package: grid
#> Loading required package: libcoin
#> Loading required package: mvtnorm

set.seed(1234L)
data("PimaIndiansDiabetes", package = "mlbench")
## a simple basic fitting function (of type 1) for a logistic regression
logit <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
  glm(y ~ 0 + x, family = binomial, start = start, ...)}

## Long name regressors
PimaIndiansDiabetes$looooong_name_1 <- rnorm(nrow(PimaIndiansDiabetes))
PimaIndiansDiabetes$looooong_name_2 <- rnorm(nrow(PimaIndiansDiabetes))
## Short name regressor
PimaIndiansDiabetes$short_name <- rnorm(nrow(PimaIndiansDiabetes))

## set up a logistic regression tree
pid_tree <- mob(diabetes ~ glucose        + 
                  looooong_name_1 +
                  looooong_name_2 +
                  short_name      | 
                  pregnant + pressure + triceps + insulin +
                  mass + pedigree + age, data = PimaIndiansDiabetes, fit = logit)

myggsummary <- function(info, digits = 2) {
  summary_vector <- character(length(info))
  for(i in seq_along(info)) {
    n <- info[[i]]$nobs
    na <- format(names(coef(info[[i]]$object)))
    cf <- format(coef(info[[i]]$object), digits = digits)
    se <- format(sqrt(diag(vcov(info[[i]]$object))), digits = digits)
    t <- format(coef(info[[i]]$object)/sqrt(diag(vcov(info[[i]]$object))) ,digits = digits)
    summary_vector[i] <- paste(paste("n =", n),
                               paste("Regressor","beta" ,"[", "t-ratio" ,"]"),
                               paste0(na, cf, "[",t,"]", collapse = "\n"), sep = "\n"
    )
  }
  summary_vector
}

library(ggparty)
#> Loading required package: ggplot2
ggparty(pid_tree) +
  geom_edge() +
  geom_edge_label() +
  geom_node_label(line_list = list(aes(label = splitvar),
                                   aes(label = paste("p =", formatC(p.value, format = "e", digits = 2)))),
                  line_gpar = list(list(size = 20),
                                   list(size = 12)),
                  ids = "inner",
                  show.legend = FALSE) +
  geom_node_label(aes(label = myggsummary(info_list)),
                  ids = "terminal",
                  family = "mono",
                  size = 2) +
  coord_cartesian(ylim = c(0.45, 1.05), xlim = c(-0.1, 1.1))

Created on 2021-01-14 by the reprex package (v0.3.0)

alvarogutyerrez commented 3 years ago

Brilliant, Thank you a lot!