tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
338 stars 67 forks source link

Combination of rsample with Amelia for missing values #28

Closed jroberayalas closed 4 years ago

jroberayalas commented 6 years ago

Hi! I was just wondering if it'd be possible to use the rsample package with Amelia, where multiple imputation is applied, involving imputing m values for each missing cell in a data matrix and creating m "completed" data sets. Given the usual uncertainty in the imputated values, Amelia offers some confidence with whatever metrics is computed, but I'm not sure if this can be combined/extended with the rsample and tidyposterior packages, which would be ideal. Any comments on this are highly appreciated!

Kind regards, Roberto

topepo commented 6 years ago

That is an interesting prospect; can we have a tidy implementation of multiple imputation methods. I don't know/think that it belongs in rsample mostly because I want to keep the scope of the package small and focused.

I think that it might be good to have a conversion function (or maybe a tidy method) that can take an imputation object and make it workable with purrr::map and other tidyverse components. Before I let Pfizer, we had a non-trivial data analysis workflow for a clinical trial that required more than a simple function call (say to lm) to do the analysis and we wrestled with how to do the MI with existing packages. A tidy approach would enable those types of analysis.

At first glance, it looks like Amelia (and mice and others) couple the imputation and analysis. While it gives you a simple api, it does make things difficult if you want to control or modify the process. Perhaps they are decoupled in worker functions in those package. I don't know enough about them. Perhaps the package authors would be interested in tidy approaches.

I have some technical thoughts I could offer based on what I've learned in rsample. Though.

(I must confess that I haven't done any multiple imputation methods (for inferential analysis) since graduate school; I'm usually worried about prediction so a single imputation usually how that's done.)

Now that I've written this, I realize that I'm rambling. What do you think?

jroberayalas commented 6 years ago

Thank you very much for your reply. I find quite interesting the different ideas that you have. Currently, I'm comparing different indicators of cumulative blood pressure (BP) exposure based on historical BP measures to assess whether it is possible to improve the performance of CVD predictive (Cox) models as those based on commonly used models. So far, I'm mostly following your examples using the recipes and rsample packages for survival analysis, since this seems a nice way to assess the importance of the cumulative BP indicators. However, the dataset I was using has some lipid variables (cholesterol, HDL, LDL,...) with a high level of missingness (around 70%), so that was the reason I was asking about the possibility to merge Amelia with rsample as both of them seem to share a lot of features. Nevertheless, I opted to simply omit the lipid variables mainly because 70% missingness is too much and I do not think the models can benefit from them at all. Your examples with recipes and survival analysis are more appropriate with what I'm working on.

I do agree that a tidy approach with MI packages may be quite useful, since a lot of health research (at least here in Oxford) seems to use it a lot to overcome the uncertainty with missing values.

juliasilge commented 4 years ago

Thanks so much for your discussion! 🙌 I'm cleaning up older issues. Currently tidymodels handles imputation in the recipes package; check out recipe steps for imputation here.

zq2323 commented 4 years ago

Thank you very much for your reply. I find quite interesting the different ideas that you have. Currently, I'm comparing different indicators of cumulative blood pressure (BP) exposure based on historical BP measures to assess whether it is possible to improve the performance of CVD predictive (Cox) models as those based on commonly used models. So far, I'm mostly following your examples using the recipes and rsample packages for survival analysis, since this seems a nice way to assess the importance of the cumulative BP indicators. However, the dataset I was using has some lipid variables (cholesterol, HDL, LDL,...) with a high level of missingness (around 70%), so that was the reason I was asking about the possibility to merge Amelia with rsample as both of them seem to share a lot of features. Nevertheless, I opted to simply omit the lipid variables mainly because 70% missingness is too much and I do not think the models can benefit from them at all. Your examples with recipes and survival analysis are more appropriate with what I'm working on.

I do agree that a tidy approach with MI packages may be quite useful, since a lot of health research (at least here in Oxford) seems to use it a lot to overcome the uncertainty with missing values.

Thanks a lot for your discussion! I' m so interested in the "examples with recipes and survival analysis" you mentioned in this reply. But I can't find any link or resource of the examples. Would you mind to share the example? I know that this example may not be found due to too long time.

jroberayalas commented 4 years ago

The example I'm talking about can be found here: https://rsample.tidymodels.org/articles/Applications/Survival_Analysis.html

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.