jkcshea / ivmte

An R package for implementing the method in Mogstad, Santos, and Torgovitsky (2018, Econometrica).
GNU General Public License v3.0
18 stars 2 forks source link

Bootstrap standard errors #31

Closed jkcshea closed 5 years ago

jkcshea commented 5 years ago

We talked about trying several bootstrap methods, but then now I realize that for this two-step estimation procedure (i.e. estimate propensity model, then estimate treatment effects) I only know of the most basic bootstrap approach (i.e. sample with replacement and estimate, repeat).

So what other bootstrap approaches did you have in mind? The names should suffice, I can go look them up.

a-torgovitsky commented 5 years ago

That's the method, but there are several ways you can use the bootstrapped estimates to construct a confidence interval.

For example, you can just look are their quantiles. Or you can compute a estimate of the standard error, etc. Chapter 11 of Cameron and Trivedi if you have it has a catalog of different ways you can use the bootstrapped estimates.

jkcshea commented 5 years ago

Ah okay, I see. Thank you.

jkcshea commented 5 years ago

Consistency of the bounds under partial identification

Below are the results from a MC simulation for the set identified case. I carried out 1000 simulations for various sample sizes. I could've cranked it up the number of simulations and sample sizes, though, as this did not take as long as I thought it would to run.

In order to run this simulation, I had to misspecify the model for the estimator, as the correct specification always led to point identification.

The bound using population distribution:

[1] 6.403721 8.480000

Averages of bounds, and standard deviations of the boundary end points, from MC simulation:

           N50      N100       N500      N1000      N2000
avgLB 5.802583 6.3049527 6.46344639 6.51640792 6.48274640
avgUB 8.349920 8.4578101 8.47441450 8.47897536 8.48080727
sdLB  2.515473 1.9791573 1.28018520 1.01966185 0.76065396
sdUB  0.791612 0.1496357 0.06320851 0.04393442 0.03102736

Albeit a small simulation, the bounds do look like they are converging to the population bounds. The behavior of the lower bounds are strange though. I plan to run this again with larger samples and more simulations, just to make sure it carries on converging.

Consistency of various estimates under point identification

The simulation is still running, this one takes a while. But a couple things perhaps worth mentioning.

When the sample is small, there are potentially collinearity issues. Currently, the estimator is programmed such that if bootstrap iteration b fails for whatever reason, the function will resample again until iteration b succeeds.

Following Cameron and Travedi, and looking around online, I was only able to implement two approaches to generating confidence intervals.

  1. The quantile method, i.e. choose the alpha/2 and (1 - alpha/2) quantiles of the bootstrap distribution

  2. The percentile method, i.e. obtain the standard deviation of the bootstrap distribution, then use that with the normal distribution critical values to construct the confidence intervals

There is another percentile method, but that requires us to have asymptotic standard error estimates. If I were to instead use the bootstrap standard error estimates, then the method is no different from the quantile method.

One other possibility was the bias corrected accelerated bootstrap. But since that requires the jacknife, I did not implement it. Let me know if you'd like to include it as an option, though.

And as alluded to above, I did not try to implement methods like residual or wild bootstraps.

Finally, confidence intervals and standard errors are calculated for all of our estimates (the propensity score model, the coefficients on the MTR, and the treatment effects).

a-torgovitsky commented 5 years ago

Partial identification

Can you run one that is correctly specified as well?

There is a tuning parameter involved in the misspecified case (the one that blows up the estimated identified set). So that in principle needs to change as the sample size does. This could be why consistency isn't so clear for the lower bound.

Point identification

I don't understand why there are still collinearity issues. I thought we had solved this. This needs to be tracked down 100% to understand what is going on. Just dropping bootstrap samples where something fails is not a good solution because it introduces something akin to sample selection.

The quantile and percentile methods are the right ones to look at. The quantile method should be correct under weaker conditions, but I'd like to see how both of them perform just in case.

jkcshea commented 5 years ago

Partial identification

Can you run one that is correctly specified as well?

Sure, I'll come up with an example. Maybe the tests cases I have right now are just too simple, and thus always result in point estimates.

Point identification

Here are the results from the simulation:

monte carlo results.pdf

The estimator looks like it is consistent, but I have to blow up the sample size to be more certain.

However, the confidence intervals are quite off... maybe I did something wrong again, so I'll be investigating.

Something I did not do, but should've done, was include the confidence intervals. This probably would have made things easier to diagnose.

And I'll investigate the collinearity.

jkcshea commented 5 years ago

Partial identification

You are right, the tuning parameter has a big impact on the bounds. But I'm unclear on why I'm getting the results I'm getting.

In the example I am using (where the model is now correctly specified), the bound derived from population should be (4.64, 11.36).

If I use the default value of the tuning parameter you mentioned (which is 1, since the tuning parameter enters multiplicatively; see #4), even with 5,000 observations, the average of the bounds across the simulations is (6.44, 8.47).

If I reduce the tuning parameter to 0.1, the average bound widens and improves to (4.91, 10.56).

Perhaps this should be expected, but I thought it was strange since reducing the tuning parameter seems akin to strengthening the constraints. I would have thought this would tighten the bounds rather than widen them.

Plotting the distributions of the lower and upper bounds estimates shows that when the tuning parameter is 1, their distributions are approximately normal but centered incorrectly. But when the tuning parameter is reduced to 0.1, then they become very skewed, but most of the mass is at the true population values. In the case of the lower bound, there is a long right tail. In the upper bound, there is a long left tail.

So let me know if this outcome is what you had expected.

Point identification

I realized I made a coding error from the initial runs, so I've been trying to re-run this. Repeated memory issues on the server have popped up (which I still don't understand), so I haven't been able to increase the size of the simulation as much as I would have liked. The version currently running has only increased the number of simulations and bootstraps by 50% each. I'll post the new results once they become available (hopefully by the morning).

I did look into the collinearity issues, though. All the cases I've singled out are due to sampling, and seem to only occur when the size of the sample is small (less than 100). The way I checked this was simply to record the samples that exhibited collinearity at some point of the estimation procedure. I then added a single observation to the bootstrap sample to see if I could resolve the problem. In all the cases I looked at, adding one observation was enough to resolve the issue.

In theory, this can still happen with larger samples. So I've updated the simulations to keep count of this and see how frequently this happens as we increase the sample size.

But how would you like the code to handle this in general? At the moment, there is a warning message that informs the user of the number of times that a bootstrap estimate failed, but the code will continue resampling whenever a collinearity issue pops up.

a-torgovitsky commented 5 years ago

Partial identification I don't think we ever actually discussed a default value for the tuning parameter. The overall factor has the form 1 + tuning So are you saying the default is tuning = 1? Or tuning = 0?

Point identification Again, I don't understand what these collinearity issues are, or why they are still popping up. I thought we had resolved this. The root cause needs to be pinned down 100% before putting in some stop-gap procedure like dropping bootstrap draws.

jkcshea commented 5 years ago

Partial identification

The way I have it set up is such that the user declares the overall factor obseq.tol = 1 + tuning.

So if the user thinks the model is correctly specified, then they should set the argument obseq.tol = 0.

By default, I have obseq.tol = 1.

I can change it such that the user declares only the tuning parameter, i.e. the user will have to set obseq.tol = -1 if they think the model is correctly specified.

Point identification

Sorry for being unclear on the nature of these collinearity issues. I'm referring to two cases.

  1. The first is when the OLS estimate/first step of FGLS fails. This happens because the X'X matrix can't be inverted because X is not full column rank.

  2. The second case is when the emat matrix (sum of the outerproduct of residuals of the OLS estimate) cannot be inverted. This is because of a lack of variation in the residuals.

In these simple test cases I've been looking at, there are only 4 kinds of agents indexed by (D, Z), where D = 0, 1, and Z = 1, 2. The cases above will occur if the bootstrap sample does not include all 4 types of agents.

The smallest sample sizes I played with was N = 25, and I never had less than 3 types of agents in the bootstrap sample. And in those cases, adding in the fourth type of agent resolved the issue. But if I bring it down to N = 10, then I sometimes do get bootstrap samples with only 2 kinds of agents. Adding one of each of the other two agents usually resolves the issue.

There are some cases where the data does include all 4 types of agents, but emat is still not invertible. Again, adjusting the bootstrap distribution by adding in one agent of any type resolves the problem (at least in the cases I've looked at).

There are cases where the emat matrix is almost not invertible. So if I calculate the rank using R's qr command, the 3 x 3 emat matrix reportedly has a rank of 2, but R is still able to invert it using the solve command, without requiring me to change the tolerance levels. In these examples, the inverse of emat contains some giant numbers, numbers that are much larger than when emat has rank 3. Again, adding in a single observation to the bootstrap sample can bring those numbers down substantially.

a-torgovitsky commented 5 years ago

Partial identification I vaguely remember talking about this previously, but now I don't remember why we did it like this.

When the model is incorrectly specified, setting obseq.tol = 1 is too small, i.e. there should be some tuning parameter added to make it larger than 1. When the model is correctly specified, it shouldn't matter what we set obseq.tol equal to, since the criterion it is multiplying will be 0.

Point identification The cases where all 4 types are not included makes sense and is expected behavior. These are small sample issues, and shouldn't occur with larger samples, or less saturated specifications. So for these, the strategy of recording the number of failures and telling the user should be fine.

The emat issue is the one I am wondering about. Didn't we already track down the cause of this issue? This is the part that needs to be completely sewn up before we can continue.

jkcshea commented 5 years ago

Partial identification

So there is no contradiction with the tuning parameter, actually. When I reduce the parameter, the bound does shrink. What was behaving in a contradictory way was the average of the bounds across the simulations.

So there are cases where Q-hat > 0, despite how the model is correctly specified. In these cases, the target parameter ends up being point identified if we set the tuning parameter to 0 (i.e. the overall factor is set to 1). And if the tuning parameter is reduced, the model becomes infeasible, as expected.

This seems to explain why the average bounds widened once I reduced the tuning parameter. The point identified cases that previously raised the average lower bound and reduced the average upper bound were removed, since the estimator now treats those cases as being "infeasible."

I'll look into why exactly we have cases where Q-hat > 0 despite the model being correctly specified.

a-torgovitsky commented 5 years ago

Q-Hat can be larger than 0 if the model is correctly specified due to sampling error in the data moments that are being matched.

Q-Hat can also be 0 if the model is incorrectly specified, think of for example a model with a lot of parameters will have no problem matching the moments.

I don't think it should necessarily be the case that the bounds when Q-Hat > 0 should be a point when you set the factor to 1 (tuning parameter to 0). This would usually require there to be a unique solution to the linear program that minimizes Q-Hat.

Can you restate what the problem is at this point? It's not clear to me...the tuning parameter should always be >= 0, and the user should not be allowed to input something < 0.

jkcshea commented 5 years ago

Hmm, the problem seems to be that I had misunderstood 2 things:

1. Issue #3

Back in June, we discussed about allowing the user to set the tuning parameter to 0.

I had misinterpreted it as meaning that we wanted to allow the user to set the overall factor to 0.

Your instructions were clear, but my thinking was not.

2. The results from the simulations when adjusting the tuning parameters

Based on playing with the tuning parameter, I had misinterpreted the simulation results as suggesting that a reduction in the tuning parameter widened the bounds on average. As we discussed yesterday, this makes no sense.

But what I realized last night was that setting the tuning parameter to be negative widened the average of the bounds.

The reason why the average of the bounds widened was because a negative tuning parameter eliminates all the point-identified cases.

a-torgovitsky commented 5 years ago

1 I should have qualified -- cases where the researcher wants to assume that the model is correctly specified with respect to the distribution of the observed data

2 Is that because when the tuning parameter was negative simulations were getting dropped for being infeasible in the bound estimation stage? Not sure I understand what you mean by "a negative tuning parameter eliminates all the point-identified cases"

jkcshea commented 5 years ago

Not sure I understand what you mean by "a negative tuning parameter eliminates all the point-identified cases"

If I understand things correctly, every time Q-hat > 0 and the tuning parameter is set to 0, we should end up being point identified in the bounding stage.

This is what happened in the simulations I was running. Mathematically, this should be because there is a unique set of coefficients on the MTRs that correspond to Q-hat.

So when I reduced the tuning parameter to be negative, that unique set of coefficients corresponding to Q-hat was no longer feasible, and then I got what you described:

... when the tuning parameter was negative simulations were getting dropped for being infeasible in the bound estimation stage

a-torgovitsky commented 5 years ago

If I understand things correctly, every time Q-hat > 0 and the tuning parameter is set to 0, we should end up being point identified in the bounding stage.

This seems like a consequence of your simulation design, but it isn't necessarily true. There might be multiple optimal solutions at the minimum Q-hat, since this is a linear program. In this case the min/max of the bounds when the tuning parameter is 0 would be the min/max of the target parameter across those solutions. I expect that if you make the estimated model richer (while still keeping it misspecified) you will be able to find cases in which Q-hat > 0 but the bounds do not collapse into a point.


Anyway, how do the simulations for the partially identified case look now? Ok?

jkcshea commented 5 years ago

They're behaving as I think they should as I vary parameters in the simulation, but I'm also not quite sure what to expect. The bounds depend on the tuning parameter, and I'm not sure what is the/if there is a correct value to set it at for given sample size.

Attached are some results from small simulations (that enormous point-identification simulation is still running on the server, but it's made progress, so I thought I'd let it finish).

debugLog-pp5-9.pdf

The distribution of the estimates of the upper bound has a long left tail, and the distribution of the estimates of the lower bound has a long right tail. So the estimates of the bound are generally too tight.

Increasing the tuning parameter allows the bounds to widen, but they can then become too wide. I tried setting tau to 4, and the estimate of the lower bound became biased downward (previously, it was biased upwards). But despite having tau = 4, the estimate of the upper bound was still biased downwards.

I'm not sure if I'm approaching this the right way, though. Whether the estimate of the bounds are consistent and unbiased seem hard to gauge unless I know what the right tuning parameter is for a given sample size.

a-torgovitsky commented 5 years ago

This all makes sense based on my experience.

Generally, these things are going to be biased inwards (lower bound upward biased, upper bound downward biased), although this will change depending on how large the tuning parameter is.

The tuning parameter is literally a tuning parameter, just like a bandwidth in kernel regression. Currently we have no theory on how to choose it, but in principle it should be possible to develop one.

For the default let's keep the tolerance at .05, so that the overall factor is 1.05.

jkcshea commented 5 years ago

Sure, I'll set it up that way, and then look into the collinearity issues from the point identified case.

jkcshea commented 5 years ago

Simulations for point identified case

Attached is the new set of simulation results.

monte carlo results.pdf

The parameter estimates for the MTR and propensity scores look to be consistent. The target parameter seems to be biased upward, but I'm still unclear on whether or not it's consistent. So I'll see how things look if I increase the sample size one more time.

The coverage probabilities of the confidence intervals are too low, though.

Collinearity issues

We discussed how if we were missing one of the four possible groups in the data, then we should expect to suffer from collinearity in the first step of the FGLS estimation. It turns out we can still successfully carry out the first OLS estimate without all four groups. But then which group we are missing determines the kind of collinearity problems we face.

So again, let each group be determined by the values of (D, Z), where Z = 1, 2. If we are missing group (1, 1), then the propensity score estimate p-hat(Z) would give us p-hat(1) = 0. So when constructing the gamma terms, all the terms associated with D = 1 will be 0 (since the region of integration is from 0 to p-hat(1) = 0). All these 0s result in a design matrix that is not full rank. The same happens if we do not include group (1, 2).

However, if we do not include group (0, 1), the design matrix is still full rank. The reason is that there are fewer terms in the MTR m0. So although all the gamma terms associated with D = 0 will be 0 for the group (1, 1), there are few enough 0s that the design matrix is still full rank. Moreover, the OLS estimate will perfectly predict all the Ys for group (1, 1). As a result, many of the error vectors are just 0s, and we get collinearity in emat. I still looking into why we end up perfectly predicting the dependent variable for group (1, 1). The same happens if I omit group (0, 2).

There is a third case where I get collinearity in emat despite including all four groups. Update: I had forgotten to mention that all the errors are non-zero. I still have to look into that.

a-torgovitsky commented 5 years ago

Point identified case A few things that pop out to me here:

  1. What is going on with the SD's for N = 100? Some of them are enormous. There must be some huge outliers here. Can you determine what is driving those?
  2. Coverage probabilities are not good. Could you describe exactly how you are doing the bootstrap?
  3. It's not clear to me from your PDF how you are drawing Y0, Y1 and then Y. Don't you need to add some error to the MTR specification? These are conditional means...to draw Y0 and Y1 you need to specify some residual as well, right?

Collinearity issues I'm concerned that maybe in 3) above you are not adding error...if that's the case then I think that could lead to these types of problems.

jkcshea commented 5 years ago

Point identified case

1. What is going on with the SD's for N = 100?

I agree, will be looking into those.

2. Coverage probabilities are not good. Could you describe exactly how you are doing the bootstrap?

So I have a true DGP, characterized by 10 rows in a data frame, dts1.

Simulation 1

  1. I draw with replacement N observations from dts1. Call this dts1-s1.

  2. I then estimate the treatment effect using dts1-s1.

  3. Then I sample with replacement B times from dts1-s1, estimating the treatment effect for each resample.

  4. a. To calculate the alpha-level confidence interval under the quantile method, I take the alpha/2 quantile and (1 - alpha/2) quantile of the distribution from step 3.

  5. b. To calculate the alpha-level confidence interval under the percentile method, I calculate the standard error of the treatment effect estimate from the distribution obtained in step 3. Then I take the alpha/2 and 1 - alpha/2 quantiles of the normal distribution, multiply them by the standard errors, and add them to the point estimate from step 2.

Simulations 2, 3, 4, ...

Repeat above.

Coverage probabilities

Each simulation gives me a point estimate (step 2) and confidence intervals (steps 4a and 4b).

For each type of confidence interval from each simulation, I count the fraction of point estimates from all simulations that fall inside of it. I then average this fraction over all confidence intervals across simulations.

Could it be to do with the fact that I have included no errors? (so your suspicion below was completely correct) But this seems unlikely, as the distribution of the treatment effect estimates and the estimates of the confidence interval will be affected by this noise the same way.

3. Drawing Y0 and Y1

Hm, embarrassed that I did not realize this. I am indeed drawing conditional means, and am not adding errors. So I'll add those in.

Collinearity issues

You are likely right that this has to do with how I'm drawing the Y1 and Y0.

a-torgovitsky commented 5 years ago

Ok, excellent, without the additional errors who knows what could happen -- this could be causing the collinearity, bootstrap and standard deviation errors. So after you take care of that lets look at the same set of results again.

jkcshea commented 5 years ago

Attached are the new simulations, where I added errors to the outcome variable ey.

monte carlo results.pdf

Confidence Intervals

The confidence intervals are still not looking good, and I haven't figured out why. As usual, I suspect I made an error somewhere, but have yet to spot it.

Let me know if the procedure I outlined above is wrong, I wouldn't be surprised if I made a mistake somewhere.

Collinearity errors

Once adding errors to the outcome variable, the rate at which bootstraps fail declines a bit (previously for N = 100 we averaged 0.44%, and now we have 0.27%; and no bootstraps failed at all for N >= 1000).

Despite all this testing for the case N = 100, I have yet to randomly come across a case where such an error pops up. But I can manually construct such cases, as discussed below.

Massive SDs for N = 100

They appear to be caused by this particular regression in the S set:

ey ~ 1 + d | factor(z)

Instrument Z is either 1 or 2. For small samples, it is possible that the fraction of agents with Z = 1 taking treatment is almost equal to the fraction of agents with Z = 2 taking treatment. As a result, the first stage fails, since Cov[D, Z] ~ 0. This results in a huge s-weight and huge value for beta_IV. This then affects the matrices involved in the FGLS estimate.

If I deliberately construct a sample such that Cov[D, Z] = 0 (i.e. a sample such that the fraction of agents with Z = 1 taking up treatment is equal to the fraction of agents with Z = 2 taking up treatment), then the function stops working and spits out an error. So I suspect this is what happened in those simulations that failed.

a-torgovitsky commented 5 years ago

Monte Carlo

The bootstrap confidence intervals not working doesn't make sense.

The part of your procedure two posts above that I don't understand is:

So I have a true DGP, characterized by 10 rows in a data frame, dts1.

Why can the true DGP be characterized by 10 rows in a data frame? Especially after adding in the error terms. (You didn't update the error terms part of the most recent pdf, right?)

A good way to debug this is to check whether you can correctly bootstrap (within a slightly modified version of the program) a simple quantity, like a sample mean. If not, then there is something wrong with the way you are redrawing the data and/or bootstrap samples. If so, then there is something wrong with the MTE part. So then try to bootstrap different MTE quantities to see if you can find a pattern of which ones work or don't work.

Collinearity Regarding the collinearity part, having a failure when Cov(D,Z) = 0 (or nearly 0) should be expected. This would be a problem even if the estimator were simple IV.

jkcshea commented 5 years ago

So I have a true DGP, characterized by 10 rows in a data frame, dts1.

Hm, I did not think things through again.

I had simply drawn data from that error-free DGP, then added errors to the observed outcomes.

But that is wrong.

I should've drawn all those us from a Uniform[0, 1], and plug them into the MTRs.

I'll correct that, as well as see how the bootstrap performs when estimating simpler quantities like the mean.

a-torgovitsky commented 5 years ago

I usually just write down the DGP the way you would think of drawing it. So, draw Z and (independently) a triple (U, V0, V1) with some dependence. Then generate D = 1(U leq p(Z)) and Y0 = m0(U) + V0 and Y1 = m1(U) + V1 and finally Y = DY1 + (1 - D)Y0

Ideally one chooses the primitives so that it is easy to analytically compute true values of the data distribution and target parameters

jkcshea commented 5 years ago

Attached are the new simulation results, after making the corrections we discussed.

monte-carlo-results.pdf

The coverage probabilities look much better now. Though I find it strange that the coverage probabilities for the quantile method were better for N = 1000 than N = 2000... could be due to sampling error, but I would think that would mostly disappear by the time we have 1000 simulations.

Note that the CIs for some of the MTR coefficients are not great. The errors I added make it impossible to consistently estimate some of the MTR coefficients. However, we can still consistently estimate the target treatment effect.

Large standard errors and collinearity issues appear again when the sample size is small. Again, this is because Pr(D = 1 | Z = 2) - Pr(D = 1 | Z = 1) = 0.2. I can run this again, but this time generate the data so that Pr(D = 1 | Z = 2) - Pr(D = 1 | Z = 1) = 0.5 to confirm that the standard errors do go down when we strengthen the instrument, and that the collinearity issues do become less frequent.

a-torgovitsky commented 5 years ago

Looking better, but the simulation setup is still not right, at least the way you wrote it down.

m{0} is a function of U only -- V{0} is the difference between Y{0} and its conditional mean, i.e. Y{0} = m{0}(U) + V{0} --> see above where then by construction E[V_{0} \vert U] = 0.

I may have made things too confusing above by suggesting to draw (U, V0, V1) with dependence. That is not necessary -- they can be drawn independently and in particular V0 and V1 can be independent of U.

jkcshea commented 5 years ago

Ah, sorry for the confusion. I have it up and running again, and I'll post the results once they're ready.

jkcshea commented 5 years ago

And here are the results:

Updated: monte-carlo-results.pdf**

Updated: In the original upload, I had forgotten to update the description of the simulation at the beginning of the documentation.

Hm, but not sure why confidence intervals for the MTR coefficients using the percentile method are so bad... I calculated them the same way as I did for the coefficients in the propensity model, which look okay. I'll double check that.

a-torgovitsky commented 5 years ago

Can you update the model specification part in the .pdf?

jkcshea commented 5 years ago

Sorry, I had forgotten to do that. Not sure if GitHub informs you that I edit comments, but the corrected version is up.

a-torgovitsky commented 5 years ago

Ok perfect, thanks. Except for the CI2 coverage for the MTR coefficients that you noted, the simulations look great now, right? Worth checking into those just to make sure there's no obvious coding mistake. One check would be to plot out the distribution (over the MC simulations) of the MTR coefficients and see whether it looks normal. It should be. But if it isn't that would be one reason CI2 would fail while CI1 would be fine. Another clue is to compare simulation-by-simulation CI1 vs CI2, and just look at whether they are massively differently or just slightly different.

jkcshea commented 5 years ago

I found the coding error when constructing the confidence intervals using the percentile method. Rerunning the simulations one more time just to confirm, and will post the results once they're ready.

Also, I checked the distributions of all the parameter estimates while debugging, and they all look quite normal, but with a slight skew.

In the mean time, I'll get on with cleaning up the remainder of the code and get it ready for CRAN.

a-torgovitsky commented 5 years ago

Sounds good. There's one more feature I want to put in before we submit, but it should be easy to code. Will post it as an issue in a few days.

jkcshea commented 5 years ago

Sure thing. And the confidence intervals now all check out, so I'll go ahead and close this issue.

monte-carlo-results.pdf