Synthetic Dataset Generation

laderast commented 6 years ago

I know that @sckott has done the charlatan package, which does some basic generation of patient variables. When we teach learning with biomedical data, we often need to use synthetic datasets because we can't use datasets with protected health information (PHI). Hence, I've had to generate patient datasets, but they need to have realistic dependencies, such as patients who don't have hypertension are not being treated for hypertension.

I have been generating synthetic data with Bayesian networks, which allow you to encode dependencies and structural zeroes in the data by specifying the conditional probability tables (CPTs), which is how you control the degree of association between variables. I have found that tuning these CPTs is a bit of an art.

The strength of this approach is that you can encode the degree of association of variables with outcome.

Would generalizing this be of interest to people?

laderast commented 6 years ago

You can see the general approach here: https://laderast.github.io/cvdRiskData/

maelle commented 6 years ago

Could https://github.com/kkholst/lava help?

laderast commented 6 years ago

@maelle interesting...I don't know much about structural equation modeling, but it seems like a nice approach, especially the graphing

maelle commented 6 years ago

I haven't used it, but I heard a talk about it and remembered thinking I'd like to try it next time I needed to simulate a dataset for sample size calculation (in epidemiology). So not much experience with it 😁

laderast commented 6 years ago

Summary: generalize synthetic data generation framework for building teaching datasets and for benchmarking ML algorithms.

softloud commented 6 years ago

I feel like I spend all of my time on data simulation! I think this is a great idea.

I'm currently working on a method with my supervisor for simulating meta-analysis data. So, I could link my doctoral work as an extension for the more general package we develop at the unconf.

laderast commented 6 years ago

Very cool, @softloud. I'd love to talk with you how you're doing this for meta-analysis.

ropensci / unconf18

Synthetic Dataset Generation #69