Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks

alxndrkalinin commented 7 years ago

Access to electronic health records (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic EHRs. Based on an input EHR dataset, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) via a combination of an autoencoder and generative adversarial networks. We also propose minibatch averaging to efficiently avoid mode collapse, and increase the learning efficiency with batch normalization and shortcut connections. To demonstrate feasibility, we showed that medGAN generates synthetic EHR datasets that achieve comparable performance to real data on many experiments including distribution statistics, predictive modeling tasks and medical expert review.

Categorize->EHR?

agitter commented 7 years ago

I agree with your labeling, and we may also want to have a Discussion subsection and synthetic training data where this would be relevant.

alxndrkalinin commented 7 years ago

@agitter I saw you already had few papers on GANs in the list, would it be interesting to mention it separately as an emerging generative model with examples?

agitter commented 7 years ago

It might be. At one point I had considered adding a Discussion section on emerging deep learning models that were being used in other domains but not biomedicine. Since then, GANs have been used in several areas we cover in the review. We have a Discussion section on data limitations (to be written) that could cover GANs and other more traditional models that train on synthetic data.

@cgreene any thoughts?

cgreene commented 7 years ago

I think that the paper fits into categorize. We have covered it in journal club already and I can write about it. I have mixed feelings about this one because the title sounds good but the paper is a bit weak. I can put it into context though. It is a topic that I care about.

agitter commented 7 years ago

I'm now noticing that pull request #322 does cover some of these ideas on data limitations already.

greenelab / deep-review

Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks #324