Open hannesdatta opened 3 years ago
Thanks for reaching out.
Sounds very interesting. Great to learn about this.
There is a clear fit with the other routines implemented in REndo.
Feel free to fork the repo, include your algorithm in the current Rendo version and submit a pull request. We will gladly review the proposed changes.
Is the paper on this model extension already published or is it still under review?
Obviously, we are happy to list you as an author on REndo's CRAN page (https://cran.r-project.org/web/packages/REndo/index.html).
Hi Markus, amazing.
I think we should first converge on a few things:
FORMULA INTERFACE
What's your vision on how you'd like such a formula interface to work? I see two options:
(option 1): latentIV(y ~ 1 + X1 + X2 + P1 + P2, data, start.params=c(), endogenous = ~ P1|2 + P2|3)
to indicate that P1 and P2 are to be treated as endogenous variables, and estimated with two classes for P1 and three classes for P2.
(option 2): latentIV(y ~ 1 + X1 + X2 + P1|2 + P2|3)
What do you think about it?
USE OF OPTIMIZER
What motivated your choice for optimx
? I can recover the parameters of a model (with a modified version of your likelihood function) much better using nlminb
, see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/nlminb.html
.
SIMULATION/RECOVERY PROPERTIES
I quickly browsed the repos for routines to check whether parameter recovery works as intended. Does such a thing exist somewhere? In my application, I simulated data to verify I can recover the true values. Maybe a good testing thing to have eventually.
Finally, the paper is not published yet. It's actually a substantive paper - not a methodological one. LIVs are just part of the robustness checks I'm working on. Do you think a technical note is necessary?
Not sure how fast I can work on this... I may also ask some of the developers at tilburgsciencehub.com to chip in on this one, let's see. So don't be surprised if other people start forking, too. ;)
Hi Hannes,
Thank you for your message and wanting to contribute. This would be a great addition to the package. Regarding your questions above:
I would go with the 2nd approach for the formula. In my opinion is closer to the interface of the other methods in the package. Markus, Patrik what are your opinions on this?
I believe we used optimx
due to the fact that it is more verbose when it comes to errors. Right, Patrik?
Hi Hannes,
FORMULA INTERFACE To be consistent with the rest of the package, I would definitely keep it in the single formula and not specifying it in a separate argument. Therefore, (option 1) rather not
(option 2) The Formula
package splits the formula into parts indicated by |. The example y ~ 1 + X1 + X2 + P1|2 + P2|3
would result in 1 + X1 + X2 + P1
and 2 + P2
and 3
, so that the number of clusters are actually not in the same sub-formula as the regressor. Also, reading the number out of a formula can be a bit tricky.
(option 3) My suggestion would be to use the "special" functions in the resulting terms object. We use it for example for the IIV
function in higherMomentsIV(y~X1+P|P|IIV(iiv=y2)+IIV(iiv=g,g=lnx,X1)
or in hetErrorsIV(y~X1+X2+P|P|IIV(X1,X2)
.
I would therefore suggest something like this to make it clear which regressor and levels go together:
latentIV(y ~ 1 + X1 + X2 + P1 + P2| endogenous(P1, 2)+endogenous(P2, 3))
with explicit args:
latentIV(y ~ 1 + X1 + X2 + P1 + P2| endogenous(reg=P1, cl=2)+endogenous(reg=P2, cl=3))
There likely should be a better name instead of endogenous()
and it should also still work with a single latent IV. Maybe actually keep it as IIV
? Or singleIV
and mulitIV
?
See function formula_readout_special
in f_formula_helpers.R
and f_heterrorsIV_IIV.R
, line 9 and f_highermomentsIV.R
, line 112 to see how the special functions are implemented for higherMomentsIV and hetErrorsIV. Also see higherMomentsIV_IIV
and hetErrorsIV_IIV
which are the methods which are actually called for the IIV in the formula.
Its not quite straightforward but reasonably elegant.
USE OF OPTIMIZER verbosity is not really the reason ... ;)
We use optimx
because it is not a single optimizer but a unified interface to multiple optimizers. It does not implement any procedure but provides a standardized way to use popular optimizers available throughout many packages. This way, everybody can choose the optimizer that fits the problem best.
In order to specify the optimization, the optimx.args
parameter is used. To use nlminb
instead of the standard Nelder-Mead
, do: latentIV(y ~ P, data = dataLatentIV, optimx.args = list(method="nlminb"))
See ?optimx
for more information what can be specified and also ?latentIV
for more examples on how to use it.
SIMULATION/RECOVERY PROPERTIES
There is currently an abundance of tests implemented in the tests/testthat/
folder. (maybe too many, it makes the package very static...).
For every method, there are 4 different types of tests:
To add to @Rgui's and @pschil's feedback:
"I quickly browsed the repos for routines to check whether parameter recovery works as intended. Does such a thing exist somewhere? In my application, I simulated data to verify I can recover the true values. Maybe a good testing thing to have eventually."
--> For all implemented methods, we have included multiple simulated datasets (and documented the true parameters) to check their validity. Please see the REndo documentation. --> Obviously, a nicer solution would be to go the extra mile and include a "simulation"-function for each approach. Thus, the user would be able to define the true values parameter values by himself and thus, create a custom dataset.
"Finally, the paper is not published yet. It's actually a substantive paper - not a methodological one. LIVs are just part of the robustness checks I'm working on. Do you think a technical note is necessary?"
--> Having a technical note on your homepage that could be linked in the help file of your latentIV-function would be helpful for users. Eventually, you can then also include the published, substantive paper. (BTW, good luck with the review process!)
As part a research project, I've extended the latentIV log-likelihood function to support multiple endogenous regressors with multiple levels. It's quite fast.
Is this something you'd be potentially interested in integrating?
I'm a bit weary throwing out yet another package with incremental functionality, while you guys have already made tackling endogeneity more accessible.
Let me know about your interest, and then I can check how the function may fit in the existing structure of the package.
Cheers, Hannes