Closed a-torgovitsky closed 5 years ago
Ah, but similar to the discussion of #66, don't we need to recenter the bootstrap distribution to reflect the null that the model is correctly specified?
e.g. (edited) adjust \hat{Q}{b} to be \tilde{Q}{b} = \max{0, \hat{Q}_{b} - \hat{Q}}?
Sure, let's do that. This is a heuristic procedure at best...so no idea whether either will be well justified theoretically. But recentering sounds more sensible. Is it going to create computational problems relative to no recentering?
Oh not at all. So this is done!
@johnnybonney could you test this on a specification in which the minimum criterion is very large, and also on one for which it is very small?
Sure, I will let you know what happens.
I tested this for three specifications, using the Dupas data.
If the null hypothesis is correct specification, shouldn't these p-values be moving in the opposite direction? It seems like we may be currently testing the null of misspecification. (Though maybe I am just confused about what exactly we are testing.)
Oh no, you're right. This was my mistake, I had flipped the inequality. I'll correct this, and let you know once I push (in midst of implementing something else at the moment).
Updated! Let me know if things make more sense now.
Retesting with the same specifications:
For specifications 2 and 3, things make more sense. In the first specification, 0 does not seem right... I believe this stems from how the minimum criteria across all bootstrap estimates are 0 as well. Then the original criterion is not lower than any of the bootstrap criteria (but it isn't any higher either).
So then mean(origEstimate$audit.minobseq < bootCriterion)
will always be 0. Should it be mean(origEstimate$audit.minobseq <= bootCriterion)
?
Also, should it actually be mean(0 <= bootCriterion)
? I could just be confusing myself here, but right now, if the original minimum criterion was 0.1, and all other minimum criteria were greater than 0.1 but less than 0.2, all elements of bootCriterion
(the bootstrap criterion minus the original criterion) would be less than the original criterion, so we would get a p-value of 0 (even though the original minimum criterion was the lowest observed).
Am I making sense? Or am I totally off base?
That's a likely cause -- should be rejecting only if the TS is strictly greater than the critical value
Sure, I'll can change it to a weak inequality.
Also, should it actually be mean(0 <= bootCriterion)? I could just be confusing myself here, but right now, if the original minimum criterion was 0.1, and all other minimum criteria were greater than 0.1 but less than 0.2, all elements of bootCriterion (the bootstrap criterion minus the original criterion) would be less than the original criterion, so we would get a p-value of 0 (even though the original minimum criterion was the lowest observed).
Good point, but I'm not sure what the right thing to do is, so I will have to defer to Alex.
I thought it was necessary to recenter the bootstrap distribution, otherwise we may have the problem of never rejecting.
But now that I think about it... does my suggestion invalidate the bootstrap?
i.e. does the bootstrap fail when the bootstrap distribution is generated by taking the max between 0 and original.criterion - bootstrap.criterion
?
Just to clarify, I didn't work through how you were computing the p-value here -- I was talking about the rejection rule itself. The rejection rule you should define as reject if TS > CV with a strict inequality. With that rule, if TS = 0, then the p-value is 1, since CV is always >= 0. (TS = test statistic, CV = critical value)
That seems to be a completely different issue from whether we are recentering or not.
Hm, I'm afraid I'm confused now. In the original post, we only wanted to report the p-value. Wouldn't recentering the bootstrap distribution affect the p-value? And in turn, wouldn't that affect whether the user decides to reject the null of correct specification?
Maybe I'm misinterpreting the messages, and we haven't yet concluded on the right way to carry out this test.
The p-value is the smallest level of significance at which you would reject. I'm not sure how you're computing the p-value directly -- there are a few ways, and frankly I get a bit confused every time I have to derive it ;).
So what I was describing was just for a given significance alpha, how would you determine rejection? You would have the test statistic TS -- in this case it is our \hat{Q} from the sample -- and you would have a critical value at level alpha CV(\alpha) -- in this case the 1 - alpha quantile of the bootstrapped criterions. And you would reject at level alpha if TS > CV(\alpha) -- with the inequality being strict as important.
So then the p-value is just the smallest value of alpha for which you get rejection, noting that CV(\alpha) is decreasing in alpha. I think the simplest way to find the p-value is just to look for the largest bootstrapped criterion Q^{\star} for which TS > Q^{\star}. Then determine the smallest alpha such that CV(\alpha) is equal to that Q^{\star}. If TS = 0, then there's no such Q^{\star}, so the smallest alpha such that TS > CV(\alpha) is 1, using the convention that the 0th quantile of any random variable is -Infinity.
If the effect of recentering is just to shift TS and CV(\alpha) by the same amount for any alpha, then it's not going to affect the rejection rule or the p-value. Is that what the re-centering is doing? Regardless, whether we recenter or not is a distinct issue from how to compute the p-value. So I was a bit confused with your previous post because it seemed to be conflating the two.
Ah yes, I'm sorry, I was unclear. I am following the procedure you outline, except for one key step: I shift the bootstrap distribution, effectively shifting the CV. But the TS remains unchanged! This must then affect the p-value, right? Alternatively, I could shift TS, while leaving CV unchanged.
But when carrying out hypothesis testing via bootstrap, mustn't there always be some 'recentering' somewhere? That is, isn't the goal to construct a distribution for the TS under the null hypothesis? Without recentering, it seems like the bootstrap distribution will reflect the DGP. And if the null isn't true, then the bootstrap distribution shouldn't reflect the null either.
For example, suppose I have a reasonably long sequence of random variables all drawn from $N(\mu, \sigma)$, where $\mu \neq 0$. I estimate $\mu$ by taking the sample average, and let's suppose it's not 0. I decide to use the non-parametric bootstrap to test if $\mu = 0$, and I take my initial estimate as the TS. If I generate my bootstrap distribution simply by resampling and re-estimating, then the bootstrap distribution should be centered around TS by construction, and not 0. I suspect I'd get p-values of around 0.5 if I performed the test this way. In contrast, if I shift the TS, or I shift CV by shifting the bootstrap distribution---but not both---then I'd get a different p-value.
Is this incorrect?
No I think that's correct and that's the right logic too
My comments today were completely about the simple matter of whether the rejection rule (and p-value calculation) have a > or a >= This seems like the most likely/simplest explanation as to why John was getting p-value 0 when the criterion was 0 -- since if the model really is "very correctly specified" then the sample and all of the bootstrap criterions are 0. So if you were using a weak inequality, you would get a p-value of 0 instead of the correct pvalue, which is 1.
Ah okay. Sorry for the confusion. I agree what John pointed out is likely to be the problem, and I've made the correction. John, let us know how it looks once you run it again. And if it still looks bad, then perhaps you can send me the example so I can debug more rigorously.
Regarding John's second comment:
if the original minimum criterion was 0.1, and all other minimum criteria were greater than 0.1 but less than 0.2, all elements of bootCriterion (the bootstrap criterion minus the original criterion) would be less than the original criterion, so we would get a p-value of 0 (even though the original minimum criterion was the lowest observed).
This seems like it could be problem when using the bootstrap for testing in general... But isn't this just an example of type I error?
That has to do with the recentering...which I'm not sure whether we should or should not here. I don't think the intuition from #66 necessarily carries over, since here we are "underidentified" not overidentified.
If we didn't recenter, then the p-value would be 1, which suggests perhaps we shouldn't recenter. What do you think? We are a bit in no-man's land here in that none of these procedures are known to be theoretically justified...just trying to think of something that is reasonable.
Ha, I am less certain than you are.
I was wondering, would we prefer to be conservative? Also, what would "conservative" mean in this setting? If a "conservative test" implies a test that is is more likely to prevent the researcher from drawing conclusions, then would that mean we want to reject the model more frequently?
I suppose one way out would be to let the user decide on whether or not to recenter, or to present both.
Conservative would mean erring on the side of not rejecting the null. (That's the usual definition -- here the null is that the model is correctly specified.) As we discussed above, we are going to be less likely to reject if not recentering, since recentering lowers the critical value. So the conservative thing to do would be to not recenter. John's example of the potential paradoxical result of recentering is also concerning to me. Add to that the lack of a clear rationale for recentering in the "underidentified case" and I think the right thing to do is not recenter.
Sound reasonable? If so, let's go with not recentering.
To interrupt this re-centering discussion and update on how the p-value currently looks with the weak inequality --
Much better.
(Edit: by "done" I meant I had removed the recentering---the concerns above make sense)
Done! And pushed. John, if it's not too much trouble, may you please run that test one more time? I'm curious to see what happens to the p-value in the third example you have.
Just ran the test, with the updated code (that removed recentering):
This is indeed a very conservative test...
In principle we can't know how conservative it is unless we know the DGP...but it would be nice to know that we can actually reject when the specification gets sufficiently stripped down. @johnnybonney do you have an example of a model where we do reject at conventional levels?
Good point. I have only tested this for those three specifications. I can try out more to see if we ever reject at the default level.
should be just a matter of reducing the number of parameters in the specification and/or adding on strong shape constraints
I have tried a few different things with the Dupas data (including constant MTRs), but I have not yet found a model where we reject (the lowest p-value I have seen is still 0.58).
I also tried specifying the MTRs to be bounded below by 3 (we know that m1
is bounded above by 1, and m0
= 0, so this model is truly quite misspecified). In this case, I get a minimum criterion of 35, and the p-value is 0.72.
Hmm...did a bit of thinking on this.
Josh, if it's not too much trouble, could you try the following changes to the bootstrap criterion?
1) Let's call the current moment functions is the mean of $g(X{i}^{\star}, \theta)$, where $X{i}^{\star}$ is a bootstrap draw of the data. Let's make these instead the means over $g(X{i}^{\star}, \theta) - g(X{i}, \theta)$, where $X{i}$ is the original data. This is somewhat like the recentering in #33 except it is $g(X{i}, \theta)$, so we are not putting in a "\hat{\theta}". That comes in the next step...
2) Instead of minimizing the bootstrap criterion over the entire space, we want to minimize it subject to $Q(\theta) \leq \hat{Q}(1 + \tau)$, just as we do for estimation. (Maybe you were doing this already?)
The critical value is still the 1-\alpha quantile of the bootstrapped criterions, just that the criterion now incorporates the above two changes. The test statistic remains as before. The rejection rule also remains as before.
This procedure roughly corresponds to "Test RS" in [this paper] (https://www.sciencedirect.com/science/article/pii/S0304407614002577) by Bugni, Canay and Shi. It seems to be roughly following the recentering intuition we discussed in #66 although I don't get the exact analogy (and the authors don't make an attempt to draw it).
Let me know if it makes sense. I can try to write it in TeX if that is easier.
I may be getting the terminology mixed up, so just to clarify:
If the last two interpretations are correct, I'm unclear on what Step 2 does. That is, the unconstrained minimum $Q{\theta}$ for a given bootstrap sample will either satisfy the new constraint or not. If it does, the constraint seems unnecessary. If it doesn't, then the problem is infeasible. Am I incorrect in reading the constraint as an upper bound on a minimization problem?
Your points 1) or 2) are correct.
For your point 3), the $Q(\theta)$ should be the sample criterion, as a function of $\theta$. Thus, there is always a $\theta$ such that $Q(\theta) \leq \hat{Q}(1+\tau)$, just as in our estimation procedure. (This is in fact the same constraint as in our estimation procedure.)
In words, we are constraining the problem to minimizers (or $\tau$--approximate minimizers) in the sample, but we are minimizing an objective that is centered as described above.
Does that make more sense?
Ah, okay, I think I had confused myself again. I thought both items appeared when determining the criterion.
So let me know if this is right:
Item 1 is what addresses the problems we're having with our specification test.
And item 2---which we are indeed doing, controlled by obseq.tol
---appears only when trying to actually estimate the bounds.
Of course, I will still have to make some changes here when implementing item 1, since the criterion will now be centered.
Not sure I understand what you mean. Both items 1 and 2 (in my post) are part of the specification test. In principle, we could run this test without estimating the bounds at all.
I'm sorry, I'm the one who is not not understanding.
Here is where I'm getting confused:
Instead of minimizing the bootstrap criterion over the entire space, we want to minimize it subject to $Q(\theta) \leq \hat{Q}(1 + \tau)$,
We established that $\hat{Q}$ is the criterion from the original sample. So this value is fixed for each bootstrap. Since this is obtained from the original sample, it cannot be recentered.
$\tau$ is a tuning parameter, so is also fixed.
Which leaves $Q(\theta)$, which you called the "sample criterion". So this is the criterion function for the sample we have at hand? And for each bootstrap, we recenter those $g$ functions in order to construct this $Q(\theta)$ object. The reason I'm confused is because I thought $Q(\theta)$ is the objective we are trying to minimize, so it seems strange to place an upper bound on this.
It seems like my misunderstanding lies in the third bullet point. So which part(s) are wrong?
there are two Q's here
we are minimizing the bootstrap Q, which is recentered
subject to the sample Q being smaller than Qhat times q plus Tau
sorry I am on my phone so I can't be more expressive with the notation. if it doesn't make sense I can Tex it up in a couple of days.
Ahh! That did it, I understand now. Thanks so much for walking me through that.
great! sorry for being unclear!
Almost done, but another question came to mind.
Regarding this:
there is always a $\theta$ such that $Q(\theta) \leq \hat{Q}(1+\tau)$, just as in our estimation procedure.
To ensure this, mustn't I also keep the audit grid identical across all bootstraps? If so, my concern is that the inference no longer accounts for the randomness of the audit procedure.
Yes good point, and that's correct. I don't think it's a first-order concern though. Presumably, most of the time our audit procedure will work correctly in that it will terminate with no violations over the large evaluation grid. So as long as that is the case there is no randomness I think. It's true that sometime bootstraps it may fail...but I'm not sure what we can do about that.
Does this affect implementation? I think the key is just starting off the audit procedure with the same seed on each bootstrap run.
Nope, it doesn't affect implementation. Although I think you highlighted another potential mistake I made.
Currently, the fine grid is reconstructed for each bootstrap sample, i.e. the whole estimation procedure is performed for each bootstrap. I'm not sure what setting the same seed for the audits would do in this case, since the bootstrap samples are different, so the fine grid (and coarse grid) will also be different. So I was thinking of the scenario where the fine grid constructed from the original sample is (by chance) much easier to satisfy than the fine grid constructed from a bootstrap sample. And as a result, the $\tau$ threshold (which by default is 0?) may not be enough to guarantee the existence of a solution.
Or did you actually want me to construct the fine grid from the original sample, and use that for all the bootstrap samples?
Or did you actually want me to construct the fine grid from the original sample, and use that for all the bootstrap samples?
This seems like an easy solution that should address all of the above issues right? Any reason not to do that? (Is it difficult to implement? Maybe you have to adjust the loop so it's not completely repeating the same procedure as in the sample.)
Yep, I can use one audit grid for all bootstraps, it's no problem. But doing that is what raises my concern that the bootstrap would fail to account for the fact that the audit grid is randomly chosen, thus the region where the shape constraints are definitely satisfied are also randomly chosen.
Then again, the audit grid is pretty big by default, and should do a good job approximating the full original sample, as well as the bootstrap samples. That is, maybe there isn't much difference between only using the audit grid from the original sample, versus reconstructing the audit grid in each bootstrap.
So having said all that, if the idea of a single audit grid continues to sound reasonable, then I'll go ahead and implement it.
Ok, let's just use a single audit grid based on the sample. I don't think that the audit grid is a big source of uncertainty now that we've redesigned it. Plus it's not clear that this type of procedure would properly account for that type of uncertainty anyway.
Done! John, could you please run that test one more time to see how things look? Also, in case you didn't see, I changed the repo name to "ivmte". So just use this new repo name when pulling.
Test results:
Below is the output in the third case. I should note that in cases two and three, the bootstrap bounds look quite odd (almost always very close to [0, 1]). I don't know if this is expected or not.
Obtaining propensity scores...
Generating target moments...
Integrating terms for control group...
Integrating terms for treated group...
Generating IV-like moments...
Moment 1...
Moment 2...
Moment 3...
Moment 4...
Moment 5...
Moment 6...
Moment 7...
Moment 8...
Moment 9...
Moment 10...
Moment 11...
Moment 12...
Moment 13...
Moment 14...
Moment 15...
Moment 16...
Moment 17...
Moment 18...
Moment 19...
Moment 20...
Moment 21...
Moment 22...
Moment 23...
Moment 24...
Moment 25...
Moment 26...
Moment 27...
Moment 28...
Moment 29...
Moment 30...
Moment 31...
Moment 32...
Moment 33...
Moment 34...
Performing audit procedure...
Audit count: 1
Generating initial grid...
Minimum criterion: 9.765
Obtaining bounds...
Generating audit grid...
Violations: 0
Audit finished.
Bounds on the target parameter: [0.4944356, 0.4944356]
Bootstrap iteration 1...
Generating initial grid...
Audit count: 1
Minimum criterion: 4.1784
Bounds: [-7.389922e-16, 1]
Bootstrap iteration 2...
Generating initial grid...
Audit count: 1
Minimum criterion: 5.6871
Bounds: [0.03693981, 1]
Bootstrap iteration 3...
Generating initial grid...
Audit count: 1
Minimum criterion: 3.4253
Bounds: [7.424616e-16, 0.998077]
Bootstrap iteration 4...
Generating initial grid...
Audit count: 1
Minimum criterion: 2.857
Bounds: [1.00614e-16, 0.9969051]
Bootstrap iteration 5...
Generating initial grid...
Audit count: 1
Minimum criterion: 4.6809
Bounds: [0.1209569, 1]
Bootstrap iteration 6...
Generating initial grid...
Audit count: 1
Minimum criterion: 2.9855
Bounds: [5.342948e-16, 0.9670301]
Bootstrap iteration 7...
Generating initial grid...
Audit count: 1
Minimum criterion: 9.1647
Bounds: [-1.016548e-15, 1]
Bootstrap iteration 8...
Generating initial grid...
Audit count: 1
Minimum criterion: 3.1889
Bounds: [-3.642919e-16, 0.9373623]
Bootstrap iteration 9...
Generating initial grid...
Error in (function (data, target, late.from, late.to, late.X, genlate.lb, :
Error:Error : Matrices must have same number of columns in rbind2(.Call(dense_to_Csparse, x), y)
If it would be more useful to you, Josh, I can clean up the code and send you the code and dataset I have been using for these tests. I don't mind running the tests, but it may simplify the debugging and testing on your end.
Hm, yes, debugging would be easier with the code, since I'm not sure why that error message is popping up all of a sudden. Thanks!
Here is a zip folder with a small dataset and the code that runs ivmte
(with 50 bootstrap replicates) on the three specifications: bootstrap_testing.zip
Let me know if anything is unclear (or if you notice any mistakes in my code)!
It turns out the problem is being caused by collinearity. Some variables in the IV-like regressions may be dropped because of collinearity in the bootstrap sample.
I wanted to confirm: once we've obtained the minimum criterion, the constraint $Q(\theta) \leq \hat{Q}(1 + \tau)$ does not appear when estimating the bounds.
If so, then the third bootstrap test by @johnnybonney gets updated to:
How does that look?
On this observation:
I should note that in cases two and three, the bootstrap bounds look quite odd (almost always very close to [0, 1])
I'm currently not sure... What comes to mind is that your S-set includes relatively few components compared to the number of terms in your MTRs (i.e. each MTR has over 100 terms, but your S-set has ~30 elements). So the constraints in the LP problem are probably relatively loose, allowing your MTRs to satisfy all the constraints while obtaining their lower/upper bounds of 0 and 1.
The problem with that thought is that you don't observe this with Specification 1, which has only 4 elements in the S-set...
If the bootstrap sample turns out to be collinear, we can just skip that sample and draw a new one. Probably good to keep a record of how many bootstrap draws were skipped for this reason.
Regarding this:
I wanted to confirm: once we've obtained the minimum criterion, the constraint $Q(\theta) \leq \hat{Q}(1 + \tau)$ does not appear when estimating the bounds.
Let's make sure we're on the same page here. There is a constraint like that when estimating bounds, just as there always has been. But it's treated differently in resampling from the one that appears in the specification test. When estimating the bounds, we change both Q (the function) and \hat{Q} with every redraw, since we are just repeating the same procedure. For the specification test, we keep the Q and \hat{Q} showing up in the constraint as in the sample, and in every redraw we change the criterion that appears in the objective function.
If the bootstrap sample turns out to be collinear, we can just skip that sample and draw a new one. Probably good to keep a record of how many bootstrap draws were skipped for this reason.
Done!
Let's make sure we're on the same page here.
Sorry, i was unclear with my question, although you answered it. That is, when estimating the bounds, there is only one constraint involving the Q. We do not additionally include the Q-constraint from the specification test.
Also, the new bootstrap results: Minimum criterion = 9.76 -- misspecification p-value of 0.08
(Edit: actually, the previous run used 250 bootstraps, whereas this result of 0.08 was from 100 bootstraps. Redoing everything with 250 bootstraps, the p-value was 0.104, with 16 bootstraps having to be redone due to collinearity)
In case anyone was wondering, the previous set of results simply carried on despite the presence of collinearity. i.e. if one of the covariates was dropped because of collinearities, but its coefficient was in the original S-set, then that just meant one less constraint in the LP problem in the bootstrap sample. Nevertheless, the corresponding restriction from the original sample still appears for the specification test.
This would run whenever
bootstraps > 0
.Let \hat{Q} be the minimum criterion value from the sample. For each bootstrap run, we get a minimum criterion value, call it \hat{Q}_{b}.
Then reject the null of correct specification at level \alpha if \hat{Q} is larger than the 1-\alpha quantile of {\hat{Q}{b}}{b=1}^{B}.
This should be really easy to implement given our current confidence interval procedure, right? We could just report a p-value for the test.