greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 271 forks source link

Discrimination + Machine Learning #302

Open cgreene opened 7 years ago

cgreene commented 7 years ago

In the subsection of 03_categorize.md for #297 this came up. I think we need a bit more discussion around the topic to sufficiently resolve this. In the interests of time, I'm creating this issue for discussion before we go back to improve the section.

cgreene commented 7 years ago

@traversc : I created this for discussion. I think there's quite a bit of literature on differences in prescription practices by doctors based on racial and ethnic groups. I don't think we need an example of this being embedded into an ML model for this section. I think showing that it would exist in training data is sufficient.

cgreene commented 7 years ago

@agitter : I don't that the potential for discrimination needs to be deep learning specific ( referring to https://github.com/greenelab/deep-review/pull/297#discussion_r110517161 ). I am hoping we can provide selected provide examples to readers to consider. I think @davharris also raised this discussion on twitter not too long ago.

davharris commented 7 years ago

I don't know a whole lot about the medical/genetics side of things, but here's a list of things that come to mind. This is from memory, so I might get some details of the anecdotes wrong, but I can look the details up if they'd be helpful.

I have lots more to say about this sort of thing, but I think I'll stop here for now. Hope some of it's useful. Let me know if you have any questions about my examples or if you're looking for something else.

traversc commented 7 years ago

@cgreen @davharris:

I found an article discussing difference in opioid prescription based on ethnicity. If you think it's a good example, I can write a short summary for the introduction section.

http://ajph.aphapublications.org/doi/abs/10.2105/AJPH.93.12.2067

PS: apologies for the deleted comment... we frequently upload things to our lab group github, so I'm often logged into the wrong account.

agitter commented 7 years ago

@davharris covered some of the examples that I had in mind when I left the comment in #297 (and more). Kate Crawford has written about these and related examples, and her NYT article could serve as a reference.

agitter commented 7 years ago

A new relevant paper is Semantics derived automatically from language corpora contain human-like biases with an accompanying discussion

cgreene commented 7 years ago

@davharris : This seems to be a topic that you're knowledgeable on. Thank you for contributing your knowledge thus far. Do you want to write a paragraph touching on this? If so, we'd love to have you contribute as a coauthor. If not, I can write based on the materials that you compiled.

davharris commented 7 years ago

Here's what I have. I'd open a PR for it myself, but it looks like the repository has a lot of structure (especially involving references) that I don't want to break.


Research samples are frequently non-representative of the general population of interest; they tend to be sicker [@doi:10.1086/512821], more male [@doi:10.1016/j.neubiorev.2010.07.002], and more European in ancestry [@doi:10.1371/journal.pbio.1001661]. One well-known consequence of these biases in genomics is that penetrance is consistently lower in the general population than would be implied by case-control data, as reviewed in @doi:10.1086/512821. Moreover, genetic associations that hold in one population may not hold in other populations with different patterns of linkage disequilibrium [even when population stratification is explicitly controlled for; @doi:10.1038/nrg2813]. As a result, many genomic findings are of limited value for people of non-European ancestry[@doi:10.1371/journal.pbio.1001661]. Methods have been developed for mitigating some of these problems in genomic studies [@doi:10.1086/512821; @doi:10.1038/nrg2813], but it is not clear how easily they can be adapted for deep models that are designed specifically to extract subtle effects from high-dimensional data. For example, differences in the equipment that tended to be used for cases versus controls have led to spurious genetic findings [e.g. @10.1126/science.333.6041.404-a]; in some contexts, it may not be possible to correct for all of these differences to the degree that a deep network is unable to use them. The availability of such nominally-irrelevant but highly-predictive data features, or of features whose value would ordinarily be known after the machine learning task is complete, is called "leakage" [@doi:10.1145/2382577.2382579]. When leakage is severe, our models may say more about the way the data was collected than they say about anything of scientific or predictive value, with potentially disastrous policy consequences [@doi:10.1111/j.1740-9713.2016.00960.x]. @doi:10.1145/2382577.2382579 discuss some ways in which leakage and its effects can be controlled, but the problem is far from being solved.

@article{doi:10.1086/512821, title={Overcoming the winner’s curse: estimating penetrance parameters from case-control data}, author={Z{\"o}llner, Sebastian and Pritchard, Jonathan K}, journal={The American Journal of Human Genetics}, volume={80}, number={4}, pages={605--615}, year={2007}, publisher={Elsevier} }

@article{doi:10.1038/nrg2813, title={New approaches to population stratification in genome-wide association studies}, author={Price, Alkes L and Zaitlen, Noah A and Reich, David and Patterson, Nick}, journal={Nature Reviews Genetics}, volume={11}, number={7}, pages={459--463}, year={2010}, publisher={Nature Publishing Group} }

@misc{10.1126/science.333.6041.404-a, title={Retraction}, author={Sebastiani, Paola and Solovieff, Nadia and Puca, Annibale and Hartley, Stephen W and Melista, Efthymia and Andersen, Stacy and Dworkis, Daniel A and Wilk, Jemma B and Myers, Richard H and Steinberg, Martin H and others}, year={2011}, publisher={American Association for the Advancement of Science} }

@article{doi:10.1145/2382577.2382579, title={Leakage in data mining: Formulation, detection, and avoidance}, author={Kaufman, Shachar and Rosset, Saharon and Perlich, Claudia and Stitelman, Ori}, journal={ACM Transactions on Knowledge Discovery from Data (TKDD)}, volume={6}, number={4}, pages={15}, year={2012}, publisher={ACM} }

@article{@doi:10.1016/j.neubiorev.2010.07.002, title={Sex bias in neuroscience and biomedical research}, author={Beery, Annaliese K and Zucker, Irving}, journal={Neuroscience \& Biobehavioral Reviews}, volume={35}, number={3}, pages={565--572}, year={2011}, publisher={Elsevier} }

@article{doi:10.1371/journal.pbio.1001661, title={Generalization and dilution of association results from European GWAS in populations of non-European ancestry: the PAGE study}, author={Carlson, Christopher S and Matise, Tara C and North, Kari E and Haiman, Christopher A and Fesinmeyer, Megan D and Buyske, Steven and Schumacher, Fredrick R and Peters, Ulrike and Franceschini, Nora and Ritchie, Marylyn D and others}, journal={PLoS Biol}, volume={11}, number={9}, pages={e1001661}, year={2013}, publisher={Public Library of Science} }

@article{doi:10.1111/j.1740-9713.2016.00960.x, title={To predict and serve?}, author={Lum, Kristian and Isaac, William}, journal={Significance}, volume={13}, number={5}, pages={14--19}, year={2016}, publisher={Wiley Online Library} }

davharris commented 7 years ago

uh, with apologies to the users named @article, @doi, and @misc for pinging them.

aaronsheldon commented 7 years ago

...and an example of automated discrimination in practice Automated Inference on Criminality using Face Images. It may be worth a quick sentence on the problems with the cited research?

cgreene commented 7 years ago

@aaronsheldon : can you file another PR to add a sentence?

davharris commented 7 years ago

I thought about this paper, but I was disinclined to reward those folks with a citation.

akundaje commented 7 years ago

I would agree with not citing this paper.

On May 9, 2017 11:51 AM, "David J. Harris" notifications@github.com wrote:

I thought about this paper, but I was disinclined to reward those folks with a citation.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/greenelab/deep-review/issues/302#issuecomment-300265845, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI7EUk2OlZKRyjiDZJHd5zwfReWtVyBks5r4LWWgaJpZM4M5Aqz .

cgreene commented 7 years ago

I think that the mention should clearly indicate the problems with the work. Including it provides the chance to take a strong stand. The role of a citation is to say that work exists. It won't help for citation counters, but anyone who reads our paper will see how the work is viewed.

Edit: This comment was constructed with feedback from @ctb and @strasser

davharris commented 7 years ago

I'd still rather cite one of the articles criticizing the analysis, e.g. 1 2 3 rather than the analysis itself.

cgreene commented 7 years ago

Article 3 in the list is really nicely done. This in particular is key:

screen shot 2017-05-09 at 4 10 52 pm

I'm strongly supportive of citing that.