Closed ddimmery closed 1 year ago
Thanks for the inquiry, @ddimmery! This is a very interesting piece of work! As you mentioned, this package may fit into a category that we don't have fully developed standards for yet, "Workflow support", or possibly under "causal inference", which is a category that might be worth pursuing though we didn't have it on the list. I'm curious as to your own thoughts as to where you think standards would be most helpful.
We have to think about which of these categories to develop based on the likely pipeline of packages and our ability to develop useful guidelines. In either case, we won't be able to review this immediately, so I'm going to apply a "holding" tag to this for the moment. (I estimate it would take at least three months until we have a new category ready.) Do let us know if there's something we can help with in the meantime. Certainly preparing using the autotest
package and preparing according the the "General" standards will hasten review down the line, and we believe help make high-quality software!
Closing this as we haven't expanded our standards categories. Will return and re-open if we do.
Submitting Author: Drew Dimmery (@ddimmery)
Repository: https://github.com/ddimmery/tidyhte/ Submission type: Pre-submission
Scope
Please indicate which category or categories from our package fit policies or statistical package categories this package falls under. (Please check an appropriate box below):
Data Lifecycle Packages
[ ] data retrieval
[ ] data extraction
[ ] database access
[ ] data munging
[ ] data deposition
[ ] workflow automation
[ ] version control
[ ] citation management and bibliometrics
[ ] scientific software wrappers
[ ] database software bindings
[ ] geospatial data
[ ] text data
Statistical Packages
[ ] Bayesian and Monte Carlo Routines
[ ] Dimensionality Reduction, Clustering, and Unsupervised Learning
[ ] Machine Learning
[ ] Regression and Supervised Learning
[ ] Exploratory Data Analysis (EDA) and Summary Statistics
[ ] Spatial Analyses
[ ] Time Series Analyses
[x] Workflow Support
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
Noam suggested by email this might be the appropriate category for my package. I'm not implementing the supervised learning/regression algorithms in and of themselves (for the moment, estimation is delegated to a combination of
SuperLearner
,nprobust
andvimp
depending on the specific task). Instead, I'm providing the tools to make sure that all of the workflow things like cross-validation are handled correctly, and that the API is sufficiently straightforward and easy to use.Suggestions about whether this fits in already established categories or not would be very helpful!
Not yet, this package is not yet ready for actual review. Also, I don't yet know which categories to abide by the standards of.
This package is for practitioners of causal inference of all types. The idea is that getting an analysis of heterogeneous causal effects should be straightforward and shouldn't require implementation of the boilerplate stuff like cross-fitting (and the definitions of how to do this should be transportable to different treatments and outcomes). The scientific applications are in a ton of different fields that have embraced causal inference: the social and behavioral sciences, epidemiology, etc.
A good number of R packages do heterogeneous treatment effect estimation:
There are also a lot of various ways to estimate aggregated causal effects, e.g.:
This is only a fairly small set of the packages in this vein, but are fairly representative.
Some of these methods and packages can be (eventually) integrated into tidyhte (e.g. balancing weight methods could eventually be used in lieu of the propensity score models, but it isn't an obvious question whether this would be valuable).
To my knowledge, there are no packages providing full support for the entire "data to HTE inference" workflow. No current packages provide full support for the methods of Kennedy (2020), which tidyhte is built specifically to support. In some sense, the tidyhte workflow also nests methods for aggregated causal inference (in such a case, it is equivalent to an AIPW estimator), but with in-built diagnostics and tests. I'm also seeking to support a lot of more complicated cases that aren't commonly supported by packages like those listed above: population weights, non-iid (clustered) data, etc.
I'm aiming to do this through an approach based on defining a "recipe" for how an analysis should be performed (constructed step-by-step using a tidy-style API), followed by a tidy-style API for actually performing estimation based on that definition that allows for additional flexibility (e.g. running similar models across many different treatment-outcome pairs).
Yes.
Not that I know of. I'm mainly interested in starting a dialogue around what kinds of standards I would need to aim for in an eventual submission.