gregpoore / tcga_rebuttal

Re-analysis of data provided by Gihawi et al. 2023 bioRxiv
25 stars 7 forks source link

Environmental contaminants #1

Open mw55309 opened 1 year ago

mw55309 commented 1 year ago

Thanks for posting this public rebuttal! Good science is open science.

There's been some suggestion that removing environmental contaminants, as done in the original paper, removes the cancer sub-type signal. See attached and the tweet below.

Whilst that analysis is sadly not open, it would be good to respond.

anon

mw55309 commented 1 year ago

Tweet is here

https://twitter.com/StevenSalzberg1/status/1686710102458335233?t=V2p-rEmdCWUpf1H2qBSWSw&s=19

gregpoore commented 1 year ago

Thanks for the question. I'm happy to respond, clarify a few things, and provide some reassuring data:

  1. When we worked on the original paper, there was no 'gold standard' list of microbes derived from tumors and TCGA lacked experimental contamination controls. This forced us to use tools like decontam (Davis et al. 2018 Microbiome) and 'black lists' of genera to infer putative contaminants. However, these approaches have limitations, both with false negatives and false positives, leading us to state in the original paper: We stress that these in silico decontamination methods are not substitutes for implementing gold-standard microbiology practices on cancer samples, including sterile processing, sterile-certified reagents, negative blanks of reagents processed from start to finish... For reference, these issues have been well described by others (e.g., Austin et al. 2023 Nature Biotech).
  2. Fortunately, a study from the Weizmann Institute of Science (WIS) that implemented those "gold-standard microbiology practices on cancer samples" appeared in Science just a few months after our original paper (Nejman et al. 2020). That list of decontaminated bacteria was expanded during our collaboration with them on fungi (Narunsky-Haziza et al. 2022 Cell), collectively providing a 'gold standard' list of bacteria and fungi found in tumors. I note that between these two studies, >1100 experimental contamination controls were employed in parallel alongside the tumors.
  3. With this background in mind, our re-analyses of TCGA in the bioRxiv rebuttal and Narunsky-Haziza et al. 2022 Cell took the more conservative approach of focusing on WIS-overlapping taxa, followed by repeating all analyses. In other words, we intersected TCGA microbial features with highly-decontaminated taxa from an independent cohort of WIS tumors. Moreover, these taxa have much better supporting data than the contaminant vs. non-contaminant calls listed in Table S6 from our original paper. (Important note: Table S6 of the original paper was satisfactory for March 2020, but there are better approaches now).
  4. Having understood the above, I repeated the same machine learning analyses in this Github repo after subsetting the Gihawi et al. raw data just using WIS-overlapping genera (n=149 genera). This is saved in the new R-script, tcga_gihawhi_rebuttal_WIS_subset_3Aug23.R and the results are approximately the same:

image (A) After subsetting to WIS-overlapping genera (n=149 genera), we evaluated if multiclass machine learning could discriminate between cancer types using the raw data from all HMS PT samples. Gradient boosting machines were applied with 10-fold cross-validation such that every sample was left out once, and their predictions were used to generate a confusion matrix. The mean balanced accuracy was 93.62% in comparison to the no information rate (NIR) of 54.84% (p<2.2e-16). (B) After subsetting to WIS-overlapping genera (n=149 genera), 10-fold cross-validation using gradient boosting machines was applied on HMS BDN samples. The balanced accuracy was 88.82% in comparison to the NIR of 80% (p=4.4e-5).

travisgibson commented 1 year ago

Thanks for making this all open source and posting on GItHub!! Might I suggest running a variable importance analysis after training your ml models.

for the 2 class model in "tcga_gihawhi_rebuttal_31July23.R" top taxa are soil or known to be hospital acquired image

for the 2 class model in "tcga_gihawhi_rebuttal_WIS_subset_3Aug23.R" top taxa are soil or could be hospital acquired image

for second analysis Rhizobium averages about 2 reads per sample, would be cool to have some uncertainty quantification with such low number of reads or see what a simple model like DESEQ does to try and discriminate the classes.

gregpoore commented 1 year ago

@travisgibson Happy to do this and provide some clarifications:

Feature Gain Cover Frequency
Prevotella 0.2572682836 0.1117143829 0.075675676
Staphylococcus 0.1569590199 0.1529085973 0.124324324
Rhizobium 0.1264024071 0.0797954770 0.048648649
Methylobacterium 0.0981973793 0.0358081653 0.027027027
Haemophilus 0.0505654742 0.0307605472 0.021621622
clozupone commented 1 year ago

It is interesting that even using the overlap with the WIS dataset that had all of these experimental controls, that Rhizobium, typically regarded as a soil microbe, is coming up. Certain species within the Rhizobium genus cause plant tumors, and of these Rhizobium radiobacter can also be found in human infections, including case reports of in cancer patients (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=abf50af543d7357152c1e9c8da9a7097eeef881f). In this study strains of R. radiobacter cultured out of human samples could not cause plant disease and were not found in environmental contaminant controls - suggestive of human adaptation. Is it possible to redo analyses here at the species level to see if the Rhizobium being identified is R. radiobacter? Similarly, a full Bradyrhizobium genome was assembled from the biopsy of a cancer patient who got colitis following a cord blood transplant in this paper (https://www.nejm.org/doi/10.1056/NEJMoa1211115). They found that this organism was highly related to the soil microbe B. japonicum, but was different and named it B. enterica, and again suggested that this isolate might be human adapted. Would it be possible to test if the Bradyrhizobium reads in this analysis are mapping closer to B. enterica than other Bradyrhizobium? This might shed light on whether the "soil bacteria" being identified here are actually these relatives that may be adapted to humans. That both Rhizobium and Bradyrhizobium closely interact with plant hosts to form symbiotic nodules, and that these relationships can "go awry" and form tumors in plants, makes it potentially interesting to explore mechanistically any potential pathway overlap with the mechanisms that these microbes exploit during tumorigenesis in plants and pathways of importance in human tumor formation. May be hard to do but I am just thinking of ways to dig a little more into mechanistic leads using sequence data.

gregpoore commented 1 year ago

@clozupone I really like your questions and suggestions. However, it's unfortunately not possible to answer them with the Gihawi et al. data, which was fixed at the genus level and did not share the reads. This main goal of this repository, by re-analyzing their data, was to show that alternative bioinformatic pipelines and reduced feature sets still yield the conclusion that microbiomes are cancer type specific, even when limiting the analyses to 9 'well known' genera.

I think there are ways to get to the species/read level and do what you're suggesting/asking about. I'll reach out via email to discuss further.

mw55309 commented 1 year ago

It is interesting that even using the overlap with the WIS dataset that had all of these experimental controls, that Rhizobium, typically regarded as a soil microbe, is coming up. Certain species within the Rhizobium genus cause plant tumors, and of these Rhizobium radiobacter can also be found in human infections, including case reports of in cancer patients (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=abf50af543d7357152c1e9c8da9a7097eeef881f). In this study strains of R. radiobacter cultured out of human samples could not cause plant disease and were not found in environmental contaminant controls - suggestive of human adaptation. Is it possible to redo analyses here at the species level to see if the Rhizobium being identified is R. radiobacter? Similarly, a full Bradyrhizobium genome was assembled from the biopsy of a cancer patient who got colitis following a cord blood transplant in this paper (https://www.nejm.org/doi/10.1056/NEJMoa1211115). They found that this organism was highly related to the soil microbe B. japonicum, but was different and named it B. enterica, and again suggested that this isolate might be human adapted. Would it be possible to test if the Bradyrhizobium reads in this analysis are mapping closer to B. enterica than other Bradyrhizobium? This might shed light on whether the "soil bacteria" being identified here are actually these relatives that may be adapted to humans. That both Rhizobium and Bradyrhizobium closely interact with plant hosts to form symbiotic nodules, and that these relationships can "go awry" and form tumors in plants, makes it potentially interesting to explore mechanistically any potential pathway overlap with the mechanisms that these microbes exploit during tumorigenesis in plants and pathways of importance in human tumor formation. May be hard to do but I am just thinking of ways to dig a little more into mechanistic leads using sequence data.

I appreciate the attempt, but neither of the studies quoted ruled out contamination

gregpoore commented 1 year ago

@mw55309 I have no involvement in those papers and suggest that you reach out to the original authors if you have concerns of contamination. However, I kindly note that the following text in Bhatt et al. 2013 NEJM directly addresses this topic:

Paired-end 76-bp or 101-bp massively parallel sequencing was performed at separate sequencing centers for each patient in order to control for possible contamination (see the Supplementary Appendix for a detailed description of the contamination analysis).

In their Supplementary Appendix, they have 2.5 pages (p. 5-7) specifically detailing how they mitigated contamination. It thus seems difficult to conclude that they did not make good faith attempts to rule it out, but I again encourage you to reach out to those authors if it remains a concern for you.