A few questions - Githubissues

mp2893 / medgan

Generative adversarial network for generating electronic health records.

BSD 3-Clause "New" or "Revised" License

270 stars 90 forks source link

A few questions #5

Closed 2g-XzenG closed 7 years ago

2g-XzenG commented 7 years ago

Hello Ed,

Thanks for sharing this great work with us!

After having trouble accessing the EHR dataset, I was wondering if we can generate synthetic data and I read this paper.

I have a few questions though:

It seems to me sequential patient data is more usable for many tasks, have you try to generate this kind of data? (as you mentioned in future work), for example, treat each patient as a matrix, each row will be a visit.
Have you try to do some real world tasks on synthetic data? If yes, can we trust the result we got form the synthetic data?

Thanks! Xianlong

mp2893 commented 7 years ago

Hi Xianlong,

Actually I'm currently working on it.
I've used the synthetic data from medGAN to train a heart-failure prediction model (I supplemented the dataset with synthetic heart-failure case patients, as they are rarer compared to control patients), and I've observed an improved recall. But this was a very preliminary work, and more rigorous evaluation is necessary.

2g-XzenG commented 7 years ago

Hi Ed,

Thanks for the reply!

For 2. Have you try to train the model entirely on the synthetic data? if the model which performs well on the synthetic data can also performs well on the real data (kind of like training and validation sets), that I think will be a strong argument that synthetic data is really good, am I right?

Also, as you mentioned heart-failure prediction model, I was wondering are you also generating the label of the EHR data? For example, heart-failure will be 1 and control will be 0 (or say can this model be used to generated labeled data? Like adding the label as the last column of the data.)

Thank you

mp2893 commented 7 years ago

Hi Xianlong,

Figure 3 and 7 in my paper is exactly what you described. I trained logistic regression classifiers with both real and synthetic data, then tested them on held-out real data. There are many details that cannot be covered here, so I recommend you read my paper.

You can generate labeled dataset in many ways. You can add an additional column like you suggested. Or you can develop a conditional generator. In my case, I trained two separate medGANs, one for case dataset, the other for control dataset. But as I said, this experiment was not rigorously conducted, so I can't say that my method is optimal.

Thanks, Ed

2g-XzenG commented 7 years ago

cool! I didn't see the connection between these two at the beginning. I think that will be very useful if we can train models without accessing the real data set. I will look into this direction.

Thanks!