markmfredrickson / RItools

Randomization inference tools for R
GNU General Public License v2.0
17 stars 11 forks source link

switch plot.xbal to ggplot2 #84

Open josherrickson opened 7 years ago

josherrickson commented 7 years ago

Right now plot.xbal has a lot of issues (see eg #82 #80 and another unreported issue that vertical margins look wonky). Have we considered moving to ggplot2 instead of base R plotting? It would address a lot of the concerns immediately. We can get 90% of the way there very quickly:

library(optmatch)
library(RItools)
data(nuclearplants)

fm <- fullmatch(pr ~ ., data = nuclearplants)

bal <- xBalance(pr ~ . + strata(fm), data = nuclearplants)

x <- as.data.frame(RItools:::prepareXbalForPlot(bal))

library(tidyverse)
x <- rownames_to_column(x)
x <- gather(x, strata, values, -rowname)

# x<- mutate(values = abs(values)) # enable for `abs = TRUE`.

ggplot(x, aes(y = rowname, x = values, color = strata, shape = strata)) + 
  geom_vline(xintercept = 0) +
  geom_point() +
  theme(legend.position = "bottom")

rplot03

markmfredrickson commented 7 years ago

I would support this. I always wanted to like lattice graphics, but could never really get them to do what I wanted so I went with base on the last rewrite. As I recall, I think the big reason was I wanted to group dummies from factor variables. If we can do that, with ggplot, then I'm all for it.

josherrickson commented 7 years ago

ggplot, by default, is sorting the y-axis alphabetically, so as long as you don't have oddity like a variable named "XYc" and a factor variable named "XY with levels "a", "b" and "d", it will group all factors together.

jwbowers commented 7 years ago

I would also support this as long as the factor grouping could happen nicely. I agree with wanting to like but not really liking lattice graphics. I might advocate +theme_bw() just on the basis that it uses less ink (maybe there is a strong psycho-visual argument in favor of gray backgrounds?).

benthestatistician commented 7 years ago

@markmfredrickson @rstudley FYI I think we're going to need to address this issue in the process of RCT work on EE. (Everybody else, sorry for being cryptic.)

josherrickson commented 7 years ago

@benthestatistician To un-cryptic that a bit, are you implying this will be handled elsewhere and we can close this issue? Or that it would be beneficial to move ahead with this idea?

rstudley commented 7 years ago

Ben, I’m not sure why you included me here. Do you need my input?

benthestatistician commented 7 years ago

@josherrickson I wasn't intending to imply anything one way or another about where the issue will be handled, rather to inform @markmfredrickson and @rstudley that the work being discussed under this thread connects to another stream of upcoming work that they (and I) will need to plan for. I just (cryptically) linked this issue to another one, leaving open the question of who should grab this issue and run with it.

The two issues are

  1. Should RItools's plot.xbal start fresh with a ggplot2 implementation, or fix up the existing base graphics implementation?
  2. Should (Mark's, Roger's and my) EE project continue to maintain and extend the existing base R based plot.xbal? My own answers would be, for the time being, yes and yes, with the EE project seeing now things go with the ggplot2 based function before switching over from what it's got.

In support of my yes to (1), I think that switching over to ggplot2 will make the plotting function easier to maintain. I defer to Jake and others on aesthetics, so if he's happy (after we accept his +theme_bw() amendment, which I support), then I'm happy too. I understand Mark's main concern to have been (largely) addressed. So there's nothing in the way of work getting started on this. In support of my yes to (2), Mark is currently responsible for implementation of EE's plotting, and I'm not sure it makes sense to ask him immediately set aside the work he's already done; also he seems to be the least familiar with ggplot2 of all of us. Rather, it seems to make more sense for work to proceed on these two streams in parallel, at least for the time being.

@josherrickson would you be willing to start us off with a new branch implementing a ggplot2-based plot.xbal? There needn't be any promise of completeness, just enough to play with. I'd like to play with it and potentially contribute; perhaps @jwbowers could do the same.

When Mark turns his attention back toward EE implementation, with @rstudley's blessing might ask him to contribute a little (<=5hrs) to whatever may remain. Even if these contributions won't necessarily get incorporated into EE immediately, I think the potential savings in coder time for the EE are sufficiently large that having Mark devote a bit of his time on that project to this effort is likely to save money for the EE project, on balance. (After an hour or two of working on the new implementation, he may well find that with a couple more hours spent adapting this work to EE he could spare himself another 10 hours that would have otherwise had to be spent.)

rstudley commented 7 years ago

Sounds like you want me to agree to up to five hours of Mark’s time to invest in a potential graphical improvement for EE. I agree, on the very mild conditions that:

  1. Mark agrees that this is more likely than not to benefit EE, and

  2. Mark’s priority remains dealing with any issues that arise due to the EE hosting migration.

josherrickson commented 7 years ago

Threw together a quick function here: 387950c7.

data(nuclearplants)

library(optmatch)
fm <- fullmatch(pr ~ ct + cost + t1 + t2, data = nuclearplants)

bal <- balanceTest(pr ~ ct + cost + t1 + t2 + strata(fm), data = nuclearplants)

plot <- plotggxbal(bal, color = c('red', 'green'),
                   strata.labels = c("fm" = "Matched", 
                                     "Unstrat" = "Unmatched" ),
                   variable.labels = c("ct" = "Cooling Tower",
                                       "cost" = "Cost",
                                       "t1" = "Time 1",
                                       "t2" = "Time 2",
                                       "(element weight)" = "(element weight)"),
                   legend.position = "bottom",
                   legend.title = "Strata",
                   var.order = c("Time 2", "Cost", "Cooling Tower",
                                 "(element weight)", "Time 1"))

plot

rplot

Notes: 1) I wrote this as a standalone function for now to postpone needing to think about how this will interact with the current plot.xbal. 2) I used some other tidyverse packages (tidyr, dplyr, tibble) to simplify some code. They are not required if those dependencies are undesired; they can be replaced by base R data.frame hacking. 3) I'm not familiar with the redesigned balanceTest call; it was producing the (element weight) object that I couldn't immediately see how to remove. I left it in, I'm sure there's some easy solution to remove it. 4) The returned object is a gg object so it can be further tweaked, e.g. @jwbowers

library(ggplot2)
plot + theme_bw()

You can even override parts which were defined in the function.

plot + theme_bw() + theme(legend.position = "right")
benthestatistician commented 7 years ago

👍 :1st_place_medal: Thanks @josherrickson !

I don't think the dependencies should be a problem. (Even in the EE, Mark's done some nice work to make it easier to keep an R installation current.)

benthestatistician commented 7 years ago

Would there be a reasonably easy way to represent groupings of variables on these plots, by adding

  1. Vertical spacing between defined groups of variables, and/or
  2. Variable group header labels, along the y-axis?

It would be something along lines of what you see here:

balplot-withsense-large-n

(Ignore "Outcome sensitivity" stuff at the bottom, at least for now.)

(Aside: current plans for #85 call for renaming "(element weights)" and removing it most of the time but not all of the time.)

josherrickson commented 7 years ago

I can think of two ways to do that.

1) Create "Reading", "Math", etc, variables whose values are NA. Would have to do some hijinks with factors and strings because it doesn't look like geom_point supports categorical y-axis values.

2) Assign each variable a y-location (e.g. "Male" would be 1, "Female" 2, then skip to 6 for "Migrant, 2 yrs") and then manually create the y-axis labels, including label "Gender" at 4 even though it has no values.

I think 2 is more likely to be an easier implementation, but I'd be hesitant to do it - anyway I can think of actually implementing it would be fragile.

josherrickson commented 7 years ago

I rethought 2., and think it might be feasible. I added some preliminary work to the branch in f2caf76. What remains is associating each "variable" (in quotes because the variables now include the group labels) with a y-value. Perhaps something like adding additional spacing above each group label?

Edit: The reason I rethought it was that ggplot is less fickle about plotting region and margins than base R - it should handle things properly in general.

benthestatistician commented 7 years ago

@josherrickson in an offline conversation you mentioned that tests were failing on the master branch in a way that was blocking your progress. When I run the tests on the master branch, the only failure I'm seeing is

Failed -------------------------------------------------------------------------
1. Failure: preparing xbalance objects for plotting, includes groups (@test.plot.xbal.R#216) 
all(...) isn't true.

more specifically, here:

   xbp <- RItools:::prepareXbalForPlot(xb)
   grps <- attr(xbp, "groups")
   expect_true(all(grps[!is.na(grps)] %in% c("x2", "x3", "x1:x2", "x2:x3", "x1:x3", "x1:x2:x3")))

Can you confirm that this is the trouble you have in mind?

(It happens that I don't have an RSVGTips on this computer, so it's skipping those tests, with a warning. If you're seeing issues related to RSVGTips, I'd encourage you to set them aside if possible.)

josherrickson commented 7 years ago

I get that error, an additional RSVGTips error, and several warnings. The warnings may be specific to my system.

Warnings -----------------------------------------------------------------------
1. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#154) - unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
  dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
  Reason: image not found

2. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#170) - unable to load shared object '/Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so':
  dlopen(/Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so, 6): Library not loaded: /opt/X11/lib/libcairo.2.dylib
  Referenced from: /Library/Frameworks/R.framework/Resources/library/grDevices/libs//cairo.so
  Reason: image not found

3. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#170) - failed to load cairo DLL

4. Issue 21: Cairo/pango errors when running plot.xbal (@test.plot.xbal.R#173) - cannot remove file '/var/folders/k0/pq3644jx7v91201_5_sm73bh0000gn/T//RtmpOpdxyV/filee3c364f9abf3', reason 'No such file or directory'

Failed -------------------------------------------------------------------------
1. Failure: preparing xbalance objects for plotting, includes groups (@test.plot.xbal.R#216) 
all(...) isn't true.

2. Error: Plotting using RSVGTips (@test.plot.xbal.R#243) ----------------------
subscript out of bounds
1: .handleSimpleError(function (e) 
   {
       e$call <- sys.calls()[(frame + 11):(sys.nframe() - 2)]
       register_expectation(e, frame + 11, sys.nframe() - 2)
       signalCondition(e)
   }, "subscript out of bounds", quote(xb$results[, "std.diff", 2])) at /Users/josh/repositories/r/ritools/tests/testthat/test.plot.xbal.R:243
2: eval(code, test_env)
markmfredrickson commented 6 years ago

As per a conversation with Ben, we're going to use the ggplot function that Josh came up with for the results of balanceTest, rather than xBalance. I've been able to use ggplot's "facet" function to get the variable groups working: issue-84-example

On the API, I've copied the existing plot function signature, which had fewer options than Josh had in his function. I did this because with ggplot, the result of our plot function can be manipulated to have different text, colors, etc. I've added some documentation on this point. Here is an example that updates the previous plot with the black and white theme and different text.

plot(bt) + ggplot2::theme_bw() + ggplot2::scale_color_manual(values = c("black", "black")) + ggplot2::labs(color = "My label", shape = "My label"

issue-84-example-bw

I'm going to clean up the other failing tests, and I think we might be able to close this issue.

benthestatistician commented 6 years ago

Exciting progress!

If you give it a data set w/ partial missingness on, say, "var", then balanceTest() should return an array w/ a row "(var)", for missingness on "var".

  1. Does it?
  2. Can we make sure that that row appears adjacent to the "(non-null record)" row?
  3. I had imagined all of these missingness rows being at the bottom. But I can also envision an argument for collecting them at the top. Maybe with an example of what it looks like with multiple non-missingness rows we'll be in a good position to decide which of these alternatives are suitable.
markmfredrickson commented 6 years ago

@benthestatistician There is a blocking issue right now related to missing data in balanceTest (#92). I'll need to figure that out before I can show an example. I think the default behavior will be to create a facet for each variable with missingness and then group each variable with it's missing indicator, but I'll have to double check that.

markmfredrickson commented 6 years ago

Here is an updated version of the plot now that issue #92 was resolved. I fixed a small bug that was causing the missing indicator to appear in a different variable group. As you can see the (X1) column appears along with X1.

issue-84-with-nas

Comments?

benthestatistician commented 6 years ago

Interesting. As noted above my thinking had been to group all the missingness indicator vars together amongst themselves, perhaps as their own variable group. But perhaps we should discuss and consider a bit before going down that route. Here are my thoughts/reactions about the current rendering scheme:

  1. I'm guessing that using facets to create a dedicated group for missingness indicators sounds like a minor extension of what we've got now -- would you agree, @markmfredrickson?
  2. It can happen that multiple variables share a missingness pattern. Our current rule is to name that pattern with reference to the first term in the generating model formula that manifests it: if the model formula was Z ~ X1 + X2 +X3, with X1 and X2 sharing the same missingness pattern, that pattern is tagged as (X1), not (X2). I suspect that the current implementation would display an (X1) row once, beneath X1, and neither (X1) nor (X2) beneath X2. This seems a little misleading to the reader: If we're going to group missingness indicators with the variables they indicate missingness for, then we should be presenting an (X2) row, identical to the (X1) row, but positioned just below X2.
  3. If our practice is to present e.g. (X) just beneath X, it would be better to present the (_non-null record_) row at the bottom rather than at the top. Any ideas about how to neatly engineer this? (Tweaks to the character string identifying it being fair game if they help.)