latentIV: support for multiple endogenous instruments

hannesdatta commented 3 years ago

As part a research project, I've extended the latentIV log-likelihood function to support multiple endogenous regressors with multiple levels. It's quite fast.

Is this something you'd be potentially interested in integrating?

I'm a bit weary throwing out yet another package with incremental functionality, while you guys have already made tackling endogeneity more accessible.

Let me know about your interest, and then I can check how the function may fit in the existing structure of the package.

Cheers, Hannes

mmeierer commented 3 years ago

Thanks for reaching out.

Sounds very interesting. Great to learn about this.

There is a clear fit with the other routines implemented in REndo.

Feel free to fork the repo, include your algorithm in the current Rendo version and submit a pull request. We will gladly review the proposed changes.

Is the paper on this model extension already published or is it still under review?

Obviously, we are happy to list you as an author on REndo's CRAN page (https://cran.r-project.org/web/packages/REndo/index.html).

hannesdatta commented 3 years ago

Hi Markus, amazing.

I think we should first converge on a few things:

FORMULA INTERFACE

the current formula interface isn't sufficient to allow for specifying (a) which of the variables should be treated as endogenous, and (b) how many classes to estimate for each of these instruments.

What's your vision on how you'd like such a formula interface to work? I see two options:

(option 1): latentIV(y ~ 1 + X1 + X2 + P1 + P2, data, start.params=c(), endogenous = ~ P1|2 + P2|3) to indicate that P1 and P2 are to be treated as endogenous variables, and estimated with two classes for P1 and three classes for P2.

(option 2): latentIV(y ~ 1 + X1 + X2 + P1|2 + P2|3)

What do you think about it?

USE OF OPTIMIZER

What motivated your choice for optimx? I can recover the parameters of a model (with a modified version of your likelihood function) much better using nlminb, see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nlminb.html.

SIMULATION/RECOVERY PROPERTIES

I quickly browsed the repos for routines to check whether parameter recovery works as intended. Does such a thing exist somewhere? In my application, I simulated data to verify I can recover the true values. Maybe a good testing thing to have eventually.

Finally, the paper is not published yet. It's actually a substantive paper - not a methodological one. LIVs are just part of the robustness checks I'm working on. Do you think a technical note is necessary?

Not sure how fast I can work on this... I may also ask some of the developers at tilburgsciencehub.com to chip in on this one, let's see. So don't be surprised if other people start forking, too. ;)

Rgui commented 3 years ago

Hi Hannes,

Thank you for your message and wanting to contribute. This would be a great addition to the package. Regarding your questions above:

I would go with the 2nd approach for the formula. In my opinion is closer to the interface of the other methods in the package. Markus, Patrik what are your opinions on this?
I believe we used optimx due to the fact that it is more verbose when it comes to errors. Right, Patrik?

pschil commented 3 years ago

Hi Hannes,

FORMULA INTERFACE To be consistent with the rest of the package, I would definitely keep it in the single formula and not specifying it in a separate argument. Therefore, (option 1) rather not

(option 2) The Formula package splits the formula into parts indicated by |. The example y ~ 1 + X1 + X2 + P1|2 + P2|3 would result in 1 + X1 + X2 + P1 and 2 + P2 and 3, so that the number of clusters are actually not in the same sub-formula as the regressor. Also, reading the number out of a formula can be a bit tricky.

(option 3) My suggestion would be to use the "special" functions in the resulting terms object. We use it for example for the IIV function in higherMomentsIV(y~X1+P|P|IIV(iiv=y2)+IIV(iiv=g,g=lnx,X1) or in hetErrorsIV(y~X1+X2+P|P|IIV(X1,X2).

I would therefore suggest something like this to make it clear which regressor and levels go together: latentIV(y ~ 1 + X1 + X2 + P1 + P2| endogenous(P1, 2)+endogenous(P2, 3)) with explicit args: latentIV(y ~ 1 + X1 + X2 + P1 + P2| endogenous(reg=P1, cl=2)+endogenous(reg=P2, cl=3))

There likely should be a better name instead of endogenous() and it should also still work with a single latent IV. Maybe actually keep it as IIV? Or singleIV and mulitIV?

See function formula_readout_special in f_formula_helpers.R and f_heterrorsIV_IIV.R, line 9 and f_highermomentsIV.R, line 112 to see how the special functions are implemented for higherMomentsIV and hetErrorsIV. Also see higherMomentsIV_IIV and hetErrorsIV_IIV which are the methods which are actually called for the IIV in the formula.
Its not quite straightforward but reasonably elegant.

USE OF OPTIMIZER verbosity is not really the reason ... ;)

We use optimx because it is not a single optimizer but a unified interface to multiple optimizers. It does not implement any procedure but provides a standardized way to use popular optimizers available throughout many packages. This way, everybody can choose the optimizer that fits the problem best.

In order to specify the optimization, the optimx.args parameter is used. To use nlminb instead of the standard Nelder-Mead, do: latentIV(y ~ P, data = dataLatentIV, optimx.args = list(method="nlminb")) See ?optimx for more information what can be specified and also ?latentIV for more examples on how to use it.

SIMULATION/RECOVERY PROPERTIES There is currently an abundance of tests implemented in the tests/testthat/ folder. (maybe too many, it makes the package very static...).

For every method, there are 4 different types of tests:

S3methods: calling all implemented s3 methods, for high test coverage and verifying they "work"
inputchecks: verifies input checks work as expected
runability: verifies method runs ("works") with different types of inputs
correctness: verifies correctness of method results (vs known results, same result after data transformations etc)

mmeierer commented 3 years ago

To add to @Rgui's and @pschil's feedback:

"I quickly browsed the repos for routines to check whether parameter recovery works as intended. Does such a thing exist somewhere? In my application, I simulated data to verify I can recover the true values. Maybe a good testing thing to have eventually."

--> For all implemented methods, we have included multiple simulated datasets (and documented the true parameters) to check their validity. Please see the REndo documentation. --> Obviously, a nicer solution would be to go the extra mile and include a "simulation"-function for each approach. Thus, the user would be able to define the true values parameter values by himself and thus, create a custom dataset.

"Finally, the paper is not published yet. It's actually a substantive paper - not a methodological one. LIVs are just part of the robustness checks I'm working on. Do you think a technical note is necessary?"

--> Having a technical note on your homepage that could be linked in the help file of your latentIV-function would be helpful for users. Eventually, you can then also include the published, substantive paper. (BTW, good luck with the review process!)

mmeierer / REndo

latentIV: support for multiple endogenous instruments #60