Open agitter opened 7 years ago
I can make a pull request (over the weekend early next week unless someone gets to it first) I think there are 2 paragraphs in the categorize section this would fit well in. (this was top of mind while we were writing them/doing the lit search the first time around)
I have only glanced at this paper, but my initial thoughts are very positive. Differential privacy has recently seen some interesting applications (e.g., 1, 2) in machine learning, but also some spectacular failures. The latter is defined by its use and undoing in the Netflix Challenge (see: this paper, notes on it, and the relevant letter). Clearly, it is conceivable that such deanonymization methods could potentially be used by an agent with access to a database of identifiable patient data. If such a database were not HIPAA compliant (e.g., DB's of rare SNPs at Ancestry.com, 23andMe, & so on), I imagine the legal barriers to use could be very high. Fortunately, the paper from @brettbj & @cgreene approaches the shortcomings of differential privacy (i.e., the deanonymization issue) by training a GAN to simulate this date whilst anonymizing it. This is a fresh approach to differential privacy, and it seems extremely applicable. My current opinion is that this method seems robust to deanonymization; considering that even the size of the dataset is unknown to the GAN's user, it could be very difficult to identify patients by rare dimensions. That being said, I have not made any attempts to compromise this approach, and I am not an expert in the field of information security or privacy assurance.
Note: if someone wants me to incorporate this text into the privacy and data sharing section, let me know where and I can put it in a PR.
Should add a discussion of #539 as well when updating the review.
This section appears to be the one most in need of a light touch to add these new contributions:
#### Data sharing is hampered by standardization and privacy considerations
I would suggest touching this sentence in the discussion to mention differential privacy:
Sharing models for patient data requires great care because deep learning models can be attacked to identify examples used in training.
Maybe also this sentence in the discussion:
However, there are complex privacy and legal issues involved in sharing patient data that cannot be ignored.
https://doi.org/10.1101/159756 (http://www.biorxiv.org/content/early/2017/07/05/159756.1)
Looking forward to this one @brettbj and @cgreene! Will one of you eventually add this to the review?