Closed markmfredrickson closed 7 years ago
I believe my conjecture was right. Including a second treated unit is sufficient to get a valid match. Then the workaround gets the right answer:
> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3),
+ Pred = c(0.08494142, -0.02784262,
+ -.05284574, -0.07428686,
+ 0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3 # WORKAROUND
> pairmatch(dose2 ~ Pred, data = tmp)
1 2 3 4 5
1.2 <NA> 1.1 1.1 1.2
Interesting. Thanks to Colin Hubbard for the tip.
I did a recover() on the example, here's what I'm seeing inside of makedist(z, data, compute_mahalanobis, within)
, the immediate parent of the failing call to toZ()
:
Browse[1]> ls()
[1] "data" "distancefn" "within" "z"
Browse[1]> z
1 2 3 4 5
TRUE TRUE TRUE TRUE TRUE
This seemed to me to suggests the error may be somewhere different from where @markmfredrickson was thinking. (But then Mark's last post popped up 15 seconds ago, so maybe I'm just not following...)
> as.logical(c(2, 3))
[1] TRUE TRUE
setMethod("toZ", "numeric", function(x) as.logical(x))
Edit: Oops just looked at Mark's commit which already solved it!
@josherrickson is correct. See 4955bb9 for my changes to that line of code.
@markmfredrickson Might it not be better to enforce that numeric treatment vectors be 0/1? Either that or make it clear in the documentation that with other values we assume the lower is control.
Interesting point by @josherrickson . If Colin Hubbard is watching, I'd appreciate hearing about his usage scenario.
Ordinarily I'd think that if some level of dose other than 0 were to serve as a control condition it would be best to have the function insist that that level be explicitly specified. Otherwise you might find that adding or removing a single observation totally changes the matching structure in unexpected ways, if the addition or removal of that obs resulted in a change to the lowest observed value of the treatment variable.
Perhaps I'm missing another usage scenario. Absent new info to that effect, I'm inclined to think Josh has a good point.
I agree with @josherrickson, with the amendments that:
function(x) {x>0}
, rather than as.logical
. Not clearly germane to the issue at hand, but if anyone understood why this errored out
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2 ~ Pred, data = tmp)
Error in s$solution : $ operator is invalid for atomic vectors
but this did not --
> match_on(dose2 ~ Pred, data = tmp)
-- I'd be interested to know. (Also your thoughts about fixability, if any.)
I don't see that error.
> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3),
+ Pred = c(0.08494142, -0.02784262,
+ -.05284574, -0.07428686,
+ 0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2 ~ Pred, data = tmp)
1 2 3 4 5
1.2 <NA> 1.1 1.1 1.2
@benthestatistician Are you implying that c(0, 1, 2)
is a valid treatment vector indicating that observations 2 and 3 are treatment? Or are you saying that the treatment vector is limited to two unique non-NA values, one of which must be 0 and the other must be > 0?
Weird, I'm certainly getting the error.
> tmp <- data.frame(id = rep(1,5), dose = c(rep(2,4),3),
+ Pred = c(0.08494142, -0.02784262,
+ -.05284574, -0.07428686,
+ 0.04335151))
> tmp$dose <- c(2,2,2,3,3)
> tmp$dose2 <- tmp$dose == 3
> pairmatch(dose2 ~ Pred, data = tmp)
Error in pairmatch(dose2 ~ Pred, data = tmp) :
could not find function "pairmatch"
> packageVersion("optmatch")
[1] ‘0.9.7’
> R.version
_
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
@josherrickson are you using optmatch 0.9-7 or development optmatch? Maybe the problem already got fixed. (If so, good to note so that we don't re-break it.)
Getting back to the main issue, I think that c(0,1,2)
should be a valid treatment vector indicating that observations 1 and 2 are treatment while observation 0 is a control. The use case I have in mind is that there are multiple flavors of treatment (perhaps different dosages) and a single type of control. The analysis stage might separate out the treatment flavors (dosages), but for matching we're willing to glom them all together.
I'm going to push back a bit on this. My primary view is that we should accept two (and only two) versions of the treatment vector: 0/1/NA and TRUE/FALSE/NA. (I'm willing to concede accepting "0"/"1"/"NA" as well, though I think we should not).
My rationale is that by only accepting a very strict set of treatment vectors, we force users to be explicit in their handling of treatment and hopefully avoid surprises. While c(0, 1, 2)
may make sense as a treatment vector in certain contexts, I would hypothesize that in most contexts, it indicates an error in the data (or the user's understanding of the data)[1]. By forcing the user who wants to use c(0, 1, 2)
to define treatment2 = treatment > 0
, we add a trivial amount of work, but protect all other users.
[1] The most common example I see of this is from survey data, where 1 = "no", 2 = "yes", 3 = "did not respond". If a client does the common trick of treatment = qualtricsdata - 1
, they'll end up with a vector with some 2's, and perhaps never realize it, thus bundling "yes" and "did not respond" into the treatment group.
Excellent points, @josherrickson . I'm persuaded.
In effect, this is an API change, so let's be sure to provide those users who'll all of a sudden be getting new errors with a clear trail of crumbs toward a solution, either through the text of the error message, additional explanation in a help page that the error message points to, or both.
It occurred to me that we should update makeOptmatch()
so that the contrast.group
attribute being set here isn't just TRUE
's and FALSE
's but also, where appropriate, NA
s. @josherrickson do you agree?
Not a heavy lift, I think...
I agree. Having trouble compiling right now so I can't address it at the moment.
Something like this should work:
cg <- rep(NA, length(names(optmatch.obj)))
cg[names(optmatch.obj) %in% treated] <- 1
cg[names(optmatch.obj) %in% colnames(distance)] <- 0
attr(optmatch.obj, "contrast.group") <- as.logical(cg)
subproblem
should be fine, but it should be double-checked as well.
I will take a look when I figure out why my computer is refusing to compile optmatch (I think it is an Rcpp issue:
Error in dyn.load(dllfile) :
unable to load shared object '/Users/josh/repositories/r/optmatch/src/optmatch.so':
dlopen(/Users/josh/repositories/r/optmatch/src/optmatch.so, 6): Symbol not found: _optmatch_ismOps
Referenced from: /Users/josh/repositories/r/optmatch/src/optmatch.so
Expected in: flat namespace
in /Users/josh/repositories/r/optmatch/src/optmatch.so
Calls: <Anonymous> -> load_all -> load_dll -> library.dynam2 -> dyn.load
but I'm not sure. I feel like everytime I try and build a package these days, Rcpp modifies R/RCppExports.R and src/RcppExports.cpp without my say-so.), but feel free to make the changes sooner if you're trying to get a new version pushed out.
I agree on each point, @josherrickson . Re compile problems, I think @markmfredrickson reported having similar trouble, then after some effort pushing through it. Mark, is this correct? Can you share some wisdom?
My laptop still builds fine. I'll have this updated shortly.
Updated in b4908b3. subproblem
just needed a proper naming.
This has been completed I believe.
Thanks @josherrickson!
Colin Hubbard sent in a bug report based on the following:
At least as a work around, it would make sense to be able to set up an indicator variable, but that fails with NA values in the distances:
I conjecture that the NA is due to the degenerate case of having only one treated unit, which makes the SD calculation fail when computing the SD of the treated group, but this is worth checking out in more depth.