Better treatment indicator handling, NA values in `match_on.formula`

markmfredrickson commented 7 years ago

Colin Hubbard sent in a bug report based on the following:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> pairmatch(dose ~ Pred, data = tmp)
Error in toZ(z) : Treatment indicator must have exactly 2 levels not 1

At least as a work around, it would make sense to be able to set up an indicator variable, but that fails with NA values in the distances:

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors
> match_on(dose2 ~ Pred)
Error in eval(expr, envir, enclos) : object 'dose2' not found
> match_on(dose2 ~ Pred, data = tmp)
An object of class "DenseMatrix"
         control
treatment  1  2  3  4
        5 NA NA NA NA
Slot "call":
match_on(x = dose2 ~ Pred, data = tmp)

I conjecture that the NA is due to the degenerate case of having only one treated unit, which makes the SD calculation fail when computing the SD of the treated group, but this is worth checking out in more depth.

markmfredrickson commented 7 years ago

I believe my conjecture was right. Including a second treated unit is sufficient to get a valid match. Then the workaround gets the right answer:

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3 # WORKAROUND
> pairmatch(dose2 ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2

benthestatistician commented 7 years ago

Interesting. Thanks to Colin Hubbard for the tip.

I did a recover() on the example, here's what I'm seeing inside of makedist(z, data, compute_mahalanobis, within), the immediate parent of the failing call to toZ():

Browse[1]> ls()
[1] "data"       "distancefn" "within"     "z"         
Browse[1]> z
   1    2    3    4    5 
TRUE TRUE TRUE TRUE TRUE

This seemed to me to suggests the error may be somewhere different from where @markmfredrickson was thinking. (But then Mark's last post popped up 15 seconds ago, so maybe I'm just not following...)

josherrickson commented 7 years ago

> as.logical(c(2, 3))
[1] TRUE TRUE

setMethod("toZ", "numeric", function(x) as.logical(x))

Edit: Oops just looked at Mark's commit which already solved it!

markmfredrickson commented 7 years ago

@josherrickson is correct. See 4955bb9 for my changes to that line of code.

josherrickson commented 7 years ago

@markmfredrickson Might it not be better to enforce that numeric treatment vectors be 0/1? Either that or make it clear in the documentation that with other values we assume the lower is control.

benthestatistician commented 7 years ago

Interesting point by @josherrickson . If Colin Hubbard is watching, I'd appreciate hearing about his usage scenario.

Ordinarily I'd think that if some level of dose other than 0 were to serve as a control condition it would be best to have the function insist that that level be explicitly specified. Otherwise you might find that adding or removing a single observation totally changes the matching structure in unexpected ways, if the addition or removal of that obs resulted in a change to the lowest observed value of the treatment variable.

Perhaps I'm missing another usage scenario. Absent new info to that effect, I'm inclined to think Josh has a good point.

benthestatistician commented 7 years ago

I agree with @josherrickson, with the amendments that:

I'd propose that the coercion mechanism applied to numeric Z's be function(x) {x>0}, rather than as.logical.
The condition that we enforce be that there has to be at least one <= 0 entry and at least one >= 0.

benthestatistician commented 7 years ago

Not clearly germane to the issue at hand, but if anyone understood why this errored out

> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors

but this did not --

> match_on(dose2 ~ Pred, data = tmp)

-- I'd be interested to know. (Also your thoughts about fixability, if any.)

josherrickson commented 7 years ago

I don't see that error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                   Pred = c(0.08494142, -0.02784262,
+                            -.05284574, -0.07428686,
+                            0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
   1    2    3    4    5 
 1.2 <NA>  1.1  1.1  1.2

josherrickson commented 7 years ago

@benthestatistician Are you implying that c(0, 1, 2) is a valid treatment vector indicating that observations 2 and 3 are treatment? Or are you saying that the treatment vector is limited to two unique non-NA values, one of which must be 0 and the other must be > 0?

benthestatistician commented 7 years ago

Weird, I'm certainly getting the error.

> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3), 
+                  Pred = c(0.08494142, -0.02784262,
+                           -.05284574, -0.07428686,
+                           0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2  ~ Pred, data = tmp)
Error in pairmatch(dose2 ~ Pred, data = tmp) : 
  could not find function "pairmatch"
> packageVersion("optmatch")
[1] ‘0.9.7’
> R.version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          4.0

@josherrickson are you using optmatch 0.9-7 or development optmatch? Maybe the problem already got fixed. (If so, good to note so that we don't re-break it.)

Getting back to the main issue, I think that c(0,1,2) should be a valid treatment vector indicating that observations 1 and 2 are treatment while observation 0 is a control. The use case I have in mind is that there are multiple flavors of treatment (perhaps different dosages) and a single type of control. The analysis stage might separate out the treatment flavors (dosages), but for matching we're willing to glom them all together.

josherrickson commented 7 years ago

I'm going to push back a bit on this. My primary view is that we should accept two (and only two) versions of the treatment vector: 0/1/NA and TRUE/FALSE/NA. (I'm willing to concede accepting "0"/"1"/"NA" as well, though I think we should not).

My rationale is that by only accepting a very strict set of treatment vectors, we force users to be explicit in their handling of treatment and hopefully avoid surprises. While c(0, 1, 2) may make sense as a treatment vector in certain contexts, I would hypothesize that in most contexts, it indicates an error in the data (or the user's understanding of the data)[1]. By forcing the user who wants to use c(0, 1, 2) to define treatment2 = treatment > 0, we add a trivial amount of work, but protect all other users.

[1] The most common example I see of this is from survey data, where 1 = "no", 2 = "yes", 3 = "did not respond". If a client does the common trick of treatment = qualtricsdata - 1, they'll end up with a vector with some 2's, and perhaps never realize it, thus bundling "yes" and "did not respond" into the treatment group.

benthestatistician commented 7 years ago

Excellent points, @josherrickson . I'm persuaded.

In effect, this is an API change, so let's be sure to provide those users who'll all of a sudden be getting new errors with a clear trail of crumbs toward a solution, either through the text of the error message, additional explanation in a help page that the error message points to, or both.

benthestatistician commented 7 years ago

It occurred to me that we should update makeOptmatch() so that the contrast.group attribute being set here isn't just TRUE's and FALSE's but also, where appropriate, NAs. @josherrickson do you agree?

Not a heavy lift, I think...

josherrickson commented 7 years ago

I agree. Having trouble compiling right now so I can't address it at the moment.

Something like this should work:

cg <- rep(NA, length(names(optmatch.obj)))
cg[names(optmatch.obj) %in% treated] <- 1
cg[names(optmatch.obj) %in% colnames(distance)] <- 0
attr(optmatch.obj, "contrast.group") <- as.logical(cg)

subproblem should be fine, but it should be double-checked as well.

I will take a look when I figure out why my computer is refusing to compile optmatch (I think it is an Rcpp issue:

Error in dyn.load(dllfile) :
  unable to load shared object '/Users/josh/repositories/r/optmatch/src/optmatch.so':
  dlopen(/Users/josh/repositories/r/optmatch/src/optmatch.so, 6): Symbol not found: _optmatch_ismOps
  Referenced from: /Users/josh/repositories/r/optmatch/src/optmatch.so
  Expected in: flat namespace
 in /Users/josh/repositories/r/optmatch/src/optmatch.so
Calls: <Anonymous> -> load_all -> load_dll -> library.dynam2 -> dyn.load

but I'm not sure. I feel like everytime I try and build a package these days, Rcpp modifies R/RCppExports.R and src/RcppExports.cpp without my say-so.), but feel free to make the changes sooner if you're trying to get a new version pushed out.

benthestatistician commented 7 years ago

I agree on each point, @josherrickson . Re compile problems, I think @markmfredrickson reported having similar trouble, then after some effort pushing through it. Mark, is this correct? Can you share some wisdom?

josherrickson commented 7 years ago

My laptop still builds fine. I'll have this updated shortly.

josherrickson commented 7 years ago

Updated in b4908b3. subproblem just needed a proper naming.

josherrickson commented 7 years ago

This has been completed I believe.

benthestatistician commented 7 years ago

Thanks @josherrickson!

markmfredrickson / optmatch

Better treatment indicator handling, NA values in `match_on.formula` #124