Simulation Part - Githubissues

nicomarto commented 3 years ago

I took a look at @marionoro 's simulation and the work looks awesome. What I have realized is that running the models with all the miss-measurement errors has the best performance and to be honest it makes sense under the traditional econometric theory, since n=1000.

Thus, I was thinking what was the gain of using PCA and it seems at first that there is none, and I think this may be true when n->inftinity (n=1000). But this may not be true when samples are small: why? Since having p=50 when having a small sample may not lead to nice results because of high-dimensionality. What do you guys think?

ijyliu commented 3 years ago

Yes, that's the most likely case for when "All Measurements" will fail.

I think actually N = 2000, so the observations label at the bottom of the table is a little misleading... maybe rename that to "Simulations" or something (bc it's 1000 sims)? But of course the point still stands.

paul-opheim commented 3 years ago

I was wondering about what you said @nicomarto. I don't get why including all variables does so well, given that the errors are serially uncorrelated within each measurement (e.g. z1). It doesn't seem like adding additional measurements should help improve the regression at all, but it does seem to help. Anyone have thoughts on that?

I'll make the change observation -> simulation @ijyliu. I agree with what you said.

nicomarto commented 3 years ago

Yes, that's the most likely case for when "All Measurements" will fail.

I think actually N = 2000, so the observations label at the bottom of the table is a little misleading... maybe rename that to "Simulations" or something (bc it's 1000 sims)? But of course the point still stands.

Oh, ok, I thought the N was for observations, not for simulations. We could try to do this in small sample and a lot of missmeasured variables and we should have better estimation under PCR

nicomarto commented 3 years ago

I was wondering about what you said @nicomarto. I don't get why including all variables does so well, given that the errors are serially uncorrelated within each measurement (e.g. z1). It doesn't seem like adding additional measurements should help improve the regression at all, but it does seem to help. Anyone have thoughts on that?

I'll make the change observation -> simulation @ijyliu. I agree with what you said.

I suspect that the coefficients of the missmeasured variables would have high se, and that would be the problem of including all miss-measured variables. But since that we only care about gamma, it is not a problem to us.

If we generalize equation 6 to more observations, we would see that the bias will depend on the matrix of covariances of all those variables. By adding more observations, we get that by LLN we would converge to those values so we identify the parameter cleanly as well

paul-opheim commented 3 years ago

Ah okay, makes sense.

ijyliu commented 3 years ago

In the empirical application, "all measurements" also leads to the smallest coefficients. PCA and the mean method are very similar (makes sense if you look at the loadings figure, very equal weighted). For some reason IV is barely different than a single measurement.

I guess it's unclear if "all measurements" is the most correct really- maybe being closest to zero kind of makes it pretty correct because it's pretty clear with fixed effects that basically the relation is not very causal...

I'm not discussing this point in the text at the moment but we can develop it more if desired.

Also, I think we can use the standard errors when discussing whether methods produce "significantly different" results from each other- like 95% CIs.

ijyliu commented 3 years ago

Also I wonder what would happen if we transformed different fractions of the covariates rather than just half.

I guess if you transformed all but one (all but gdp for example) you maybe do best with PCA. The averaging and IV would maybe not do so well. Idk how you beat direct inclusion though.

paul-opheim commented 3 years ago

This makes sense. Do you think we should change the paper because of this, or are you more just wondering about this question?

ijyliu commented 3 years ago

Oh, wait, IV is doing the best now? Seems like it is doing way better than everything else. That doesn't seem to fit well with the empirical case. What happened? Something we can do to knock on IV though is mention that it seems to have a very high standard deviation in the sims.

I guess it must be because we switched from the p table to the rho table.

Update: the reason for the change was the fact that we switched from doing IV by hand to using the functionality built into statsmodels

ijyliu commented 3 years ago

As for trying different amounts of transformation, was there a particular reason Bonhomme said to try half @nicomarto ?

paul-opheim commented 3 years ago

I picked half. I don't think Bonhomme told us to do that specifically; it just seemed like an easy way to ensure that we had measurements on two different scales.

ijyliu commented 3 years ago

Discussion continues in other issues. I don't think trying different amounts of transformation is important anymore because we seem to have found relatively ideal conditions for our estimator anyway.

ijyliu / ECMA-31330-Project

Simulation Part #75