greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.24k stars 272 forks source link

Privacy-preserving generative deep neural networks support clinical data sharing #563

Open agitter opened 7 years ago

agitter commented 7 years ago

https://doi.org/10.1101/159756 (http://www.biorxiv.org/content/early/2017/07/05/159756.1)

Though it is widely recognized that data sharing enables faster scientific progress, the sensible need to protect participant privacy hampers this practice in medicine. We train deep neural networks that generate synthetic subjects closely resembling study participants. Using the SPRINT trial as an example, we show that machine-learning models built from simulated participants generalize to the original dataset. We incorporate differential privacy, which offers strong guarantees on the likelihood that a subject could be identified as a member of the trial. Investigators who have compiled a dataset can use our method to provide a freely accessible public version that enables other scientists to perform discovery-oriented analyses. Generated data can be released alongside analytical code to enable fully reproducible workflows, even when privacy is a concern. By addressing data sharing challenges, deep neural networks can facilitate the rigorous and reproducible investigation of clinical datasets.

Looking forward to this one @brettbj and @cgreene! Will one of you eventually add this to the review?

brettbj commented 7 years ago

I can make a pull request (over the weekend early next week unless someone gets to it first) I think there are 2 paragraphs in the categorize section this would fit well in. (this was top of mind while we were writing them/doing the lit search the first time around)

evancofer commented 7 years ago

I have only glanced at this paper, but my initial thoughts are very positive. Differential privacy has recently seen some interesting applications (e.g., 1, 2) in machine learning, but also some spectacular failures. The latter is defined by its use and undoing in the Netflix Challenge (see: this paper, notes on it, and the relevant letter). Clearly, it is conceivable that such deanonymization methods could potentially be used by an agent with access to a database of identifiable patient data. If such a database were not HIPAA compliant (e.g., DB's of rare SNPs at Ancestry.com, 23andMe, & so on), I imagine the legal barriers to use could be very high. Fortunately, the paper from @brettbj & @cgreene approaches the shortcomings of differential privacy (i.e., the deanonymization issue) by training a GAN to simulate this date whilst anonymizing it. This is a fresh approach to differential privacy, and it seems extremely applicable. My current opinion is that this method seems robust to deanonymization; considering that even the size of the dataset is unknown to the GAN's user, it could be very difficult to identify patients by rare dimensions. That being said, I have not made any attempts to compromise this approach, and I am not an expert in the field of information security or privacy assurance.

Note: if someone wants me to incorporate this text into the privacy and data sharing section, let me know where and I can put it in a PR.

cgreene commented 7 years ago

Should add a discussion of #539 as well when updating the review.

This section appears to be the one most in need of a light touch to add these new contributions: #### Data sharing is hampered by standardization and privacy considerations

I would suggest touching this sentence in the discussion to mention differential privacy: Sharing models for patient data requires great care because deep learning models can be attacked to identify examples used in training.

Maybe also this sentence in the discussion: However, there are complex privacy and legal issues involved in sharing patient data that cannot be ignored.