mp2893 / medgan

Generative adversarial network for generating electronic health records.
BSD 3-Clause "New" or "Revised" License
270 stars 90 forks source link

Generating different features #13

Open Tomeu7 opened 5 years ago

Tomeu7 commented 5 years ago

Hello,

If we generate data with two types of features let's say age and diagnosis code. Could we use directly data_type = "count" hence getting the number of times a person has been diagnosed with that problem and their age?

Otherwise for binary you mentioned that you need to modify the activation functions. For example if we use age and diagnosis code and the last feature is the age. Is it enough to use ReLu for the count variable and sigmoid for the binary variables? What about the Tanh, should I modify them too?

Thanks in advance.

mp2893 commented 5 years ago

Hi,

As for the first question, yes. You can use data_type="count". medGAN uses a technique called minibatch averaging, which tries to generate features that have a similar mean value as the true features.

As for the second question: the current medGAN implementation does not support using different activation functions for different features. You would have to modify the code yourself. Given that, yes, you should use sigmoid function to guarantee binary outcome, and relu function to guarantee non-negative outcome. Tanh is used only for the intermediate layers, so changing that to other activation functions could improve performance for different tasks, but it's more of a tuning issue.

Best, Ed

Tomeu7 commented 5 years ago

Hello, thanks for the answer.

I've tried to use "count" in a dataset combining diagnosis + age and it has strange results. First of all the mean of the original dataset is around 0.005-0.00005 per diagnosis and the mean of age around 40 (with a maximum of around 85).

Despite this the generated result for some diagnosis is over 1 (for some even 10 or 20) and for the age the result can be even over 200. Do you know why could that be? Maybe there is not enough data.

Another option is to transform age to binary as they say in the paper and include as another feature.

Thanks, Tomeu.

mp2893 commented 5 years ago

I don't know the full detail of your experiments, so I can only guess the source of problem. For diagnosis codes, you wanted to generate counts, so "the generated result for some diagnosis is over 1 (for some even 10 or 20)" seems fine to me. As for age, there can be outliers since medGAN has no idea what "age" really is. The generator just tries to trick the discriminator, and getting the right balance between the generator and the discriminator is the key. (it requires some time to figure out which epoch to choose the model from). As you said, having large data definitely helps. I've had a better experience with 260K patient records (proprietary data from Sutter Health) than 50K patient records (MIMIC-III).

Best, Ed

Tomeu7 commented 5 years ago

Thanks again Ed,

I have been playing with modyfing the neural network in generateData, buildDiscriminator and build Autoencoder, the places where the sigmoid/relu was used to identify between binary and count. I used binary crossentropy for binary features and MSE for count and the different activation functions (sigmoid/relu respectively).

The results are strange because before with only binary features the decrease in accuracy from the discriminator was constant but now it has a lot of variability (in a few epochs the results are either 0.5 or 1 in accuracy).

Additionally I set RELU for the intermediate layers which maybe I should change??

Thanks in advance for any additional comment!

mp2893 commented 5 years ago

Using relu in the intermediate layers should not be a huge factor. I wouldn't assume the accuracy of the discriminator to decrease monotonically, because GAN is basically a game between the generator and the discriminator. I can't say much since I do not know the full extent of the changes you made to the code. However, it is strange that the accuracy is either 0.5 or 1. If you chose a model from some random epoch and generate data, do the synthetic data look real, or are they completely off? (and if the results are off, is it the case for models from all epochs?)

Tomeu7 commented 5 years ago

Choosing from an epoch that has 0.5 or 1 accuracy the results are somewhat random. Otherwise there are some epochs that do not give exactly 0.5 or 1 accuracy, for example 0.6 and the results are a little bit similar to the original dataset, at least in the binary features but much worse than when I used binary only.

I know that the expected results should be worse but I am trying to see if I did something wrong or there could be any improvement available.

Thanks!

mp2893 commented 5 years ago

What medGAN can actually do all depends on how large and complex your dataset is. Unless I'm sitting side by side with you, looking at your monitor, I'm just poking at the dark here. If it's only the age and the number of occurrences for some diagnosis code, then you can try normalizing their values, then use data_type="count", with sufficiently large batch size. (If you've already tried this, I really can't help you much)

2g-XzenG commented 5 years ago

Hi @Tomeu7 @mp2893 Thanks for your discussion, they are very helpful! I pick up some methods from you and ran the MedGAN model on two datasets, MIMIC-3 and NCH (500K sample size, proprietary data from Nationwide children's). Here are some experiment results I got:

MIMIC-3 data (diagnosis; data-type="count"): synthetic data looks good (sanity check with dimension-wise average count). Entire NCH data (diagnosis; data-type="count"): synthetic data looks good. Entire NCH data (diagnosis + age; data-type="count"): synthetic data looks good. Entire NCH data (diagnosis + age + paid; data-type="count"): synthetic data looks bad. Part of the NCH data (50K, diagnosis + age; data-type="count"): synthetic data looks ok, not as good as before.

Given the observation above, and the fact that the range of age is small and the range of paid amount is huge, my guess is: MedGAN is able to generate features with a reasonable range (such as age), as long as you have enough data points.

Note that I didn't change the code. Hope this info helps! Thanks:)

mp2893 commented 5 years ago

Hi Xianlong,

Thanks for sharing your results. It's interesting that medGAN performs poorly when paid amount is used as a feature. Maybe this can be mitigated by providing standard deviation of paid amounts in a minibatch to the discriminator (in addition to providing the average of paid amounts in a minibatch). This way, unless the generator properly generates diverse paid amounts, it won't be able to fool the discriminator.

Best, Ed

2g-XzenG commented 5 years ago

Hi Ed,

Thanks for the suggestion! I actually thought of adding std, median together with the minibatch average. The performance is still not good.

I also tried to normalized the paid feature, although it will downgrade the influence of minibatch average, the quality of the generated data improved quite a bit: the dimension-wise average count for diagnosis code looks good, but the generated paid amount (transformed back) still looks bad.

It seems like if I don't normalize the paid amount, the huge range of this feature (range from 0 to 3 million) will put a lot of burdens to train the discriminator and autoencoder. If I normalized this feature, the differences will be amplified by the un-normalization.

I am trying to train the un-normalized data using a larger epoch size, but other than that, do you have any suggestion?

Thanks! Xianlong

mp2893 commented 5 years ago

If you normalized the paid amounts by subtracting the mean and dividing by the standard deviation, medGAN won't work, because the normalized paid amounts will have negative values. (note that medGAN only supports either binary or count values). How about dividing the paid amounts by some powers of ten (e.g. 10000)?

Myshgithub commented 5 years ago

Hi All As it has been discussed in (Other fields in data generation #10: https://github.com/mp2893/medgan/issues/10), Patient ID is not being generated by medGAN (unless we include them in the training data). So, my question is medGAN generate synthetic samples with same and in order Patient ID when we run it many different times? and then can we compare different generated records for every patient, if so? Thank you

mp2893 commented 5 years ago

Hi Myshgithub,

The synthetic records generated by MedGAN will have no relationship whatsoever with the original Patient IDs. The generated records will be purely synthetic, and every time you generate a new batch, you get a fresh batch of synthetic records. Sometimes you might get duplicate records by chance, but that doesn't mean they are the same patient. MedGAN does not understand the concept of Patient ID, unless you modify it somehow.

Best,