Open laderast opened 6 years ago
You can see the general approach here: https://laderast.github.io/cvdRiskData/
Could https://github.com/kkholst/lava help?
@maelle interesting...I don't know much about structural equation modeling, but it seems like a nice approach, especially the graphing
I haven't used it, but I heard a talk about it and remembered thinking I'd like to try it next time I needed to simulate a dataset for sample size calculation (in epidemiology). So not much experience with it 😁
Summary: generalize synthetic data generation framework for building teaching datasets and for benchmarking ML algorithms.
I feel like I spend all of my time on data simulation! I think this is a great idea.
I'm currently working on a method with my supervisor for simulating meta-analysis data. So, I could link my doctoral work as an extension for the more general package we develop at the unconf.
Very cool, @softloud. I'd love to talk with you how you're doing this for meta-analysis.
I know that @sckott has done the
charlatan
package, which does some basic generation of patient variables. When we teach learning with biomedical data, we often need to use synthetic datasets because we can't use datasets with protected health information (PHI). Hence, I've had to generate patient datasets, but they need to have realistic dependencies, such as patients who don't have hypertension are not being treated for hypertension.I have been generating synthetic data with Bayesian networks, which allow you to encode dependencies and structural zeroes in the data by specifying the conditional probability tables (CPTs), which is how you control the degree of association between variables. I have found that tuning these CPTs is a bit of an art.
The strength of this approach is that you can encode the degree of association of variables with outcome.
Would generalizing this be of interest to people?