gregpoore / tcga_rebuttal

Re-analysis of data provided by Gihawi et al. 2023 bioRxiv
25 stars 7 forks source link

Full dataset for 32 TCGA cancers from Oncogene 2024 new pipeline #3

Open hermidalc opened 5 months ago

hermidalc commented 5 months ago

Dear Greg - thanks for your continued work on this and for this follow up paper with updated pipeline. I'm the lead author of Hermida et al. Nat Commun 2022 which used your original Poore et al. Nature 2020 dataset, specifically the "Kraken-TCGA-Voom-SNM-Plate-Center-Filtering-Data.csv” and “Metadata-TCGA-Kraken-17625-Samples.csv” from ftp://ftp.microbio.me/pub/cancer_microbiome_analysis.

Do you have the microbial abundances from this updated Oncogene 2024 pipeline for all 32 TCGA cancers in a format similar to the two files mentioned above?

gregpoore commented 5 months ago

Hi @hermidalc, apologies for the delayed reply. I'm happy to help provide you with files but first have some comments:

To summarize, the tables we currently recommend would be (i) raw or ConQuR-corrected, (ii) separate for WGS and RNA-Seq, (iii) comprise human-associated taxa with high coverage, and (iv) derive from direct alignments against RS210-clean. Table S13 of the paper provides the raw (WGS and RNA-Seq) abundances of RS210-clean taxa with ≥50% aggregate coverage (note: Table S6 contains the taxonomic names for the genome IDs in Table S13). However, I can find and share the ConQuR-corrected WGS table here if that would be helpful. Is that what you would prefer?

Edit: There are creative ways to get ConQuR to correct for >1 batch variable at a time, such as concatenating multiple factors into a single vector in R, but we did not publish this data (in part because of limited time to reassure ourselves the correction acted appropriately). In theory, doing this would help create the kind of single abundance table you're asking about. If this is something you want, I can point you in the right direction.

hermidalc commented 4 months ago

Thanks very much Greg for the detailed breakdown and explanation. After giving it some thought I believe the last option would be ideal, a joint ConQuR abundance table for all the samples. I would definitely appreciate any advice or help to create this matrix by as you said batch correcting for >1 factor with ConQuR

gregpoore commented 4 months ago

Hi @hermidalc, sure, I'm happy to help with that. (Apologies again for the delay.) The basic idea is to concatenate the individual batch variables into one string per sample, followed by factorizing the options. ConQuR can then accept those factors as pseudo-batches to correct (assuming sufficient representation among the pseudo-batches); note that your reference batch will have to reflect one of the concatenated batch names in the ConQuR function. For example, to simultaneously correct over sequencing center, sequencing platform, and experimental strategy in TCGA:

## Note 1: To simplify the batch names, I've used gsub() to remove punctuation and spaces
## Note 2: I've concatenated the batch variables using "__"
## Note 3: Assume that metadata and count data are held in _metadataSamplesAllQC_ and _countDataQC_ respectively

metadataSamplesAllQC2 <- metadataSamplesAllQC
metadataSamplesAllQC2$batchVar <- factor(paste0(gsub('([[:punct:]])|\\s+','',metadataSamplesAllQC2$data_submitting_center_label),
                                         "__",
                                         gsub('([[:punct:]])|\\s+','',metadataSamplesAllQC2$platform),
                                         "__",
                                         gsub('([[:punct:]])|\\s+','',metadataSamplesAllQC2$experimental_strategy)))

batchid_TCGA_All <- metadataSamplesAllQC2[,"batchVar"]
covar_TCGA_All <- metadataSamplesAllQC2[,"sample_type",drop = FALSE]
covar_TCGA_All$sample_type <- factor(covar_TCGA_All$sample_type)
countDataQC_CQ <- as.data.frame(ConQuR(tax_tab = countDataQC,
                                      batchid=batchid_TCGA_All,
                                      covariates=covar_TCGA_All, 
                                      num_core = 32,
                                      batch_ref = "HarvardMedicalSchool__IlluminaHiSeq__WGS"))

The above could take a long time to run across thousands of samples without multiplexing across cores (n=32 above), and the vanilla ConQuR version proposed its authors (https://github.com/wdl2459/ConQuR) did not implement the doParallel multithreading properly (at least when applying it in Fall 2023). After looking through my notes, I think there was a branch of their ConQuR repo that fixed the multithreading (https://github.com/ivartb/ConQuR_par) and would recommend simple testing (eg, using tic() and toc() to measure run time of their tutorial) to verify that the multithreading works as-expected before applying it to all of TCGA.

I hope that helps! Let me know if you have any other questions

kennyyeo13 commented 3 months ago

Hi Greg,

I was looking through the RS210 data (table S13) from your paper. There were 1189 OGUs from Table S13, however, taxonomic information was only found 823 of these in TableS6. Just wanted to check, how do I get the taxonomic information for the other 366 OGU? Hope you can help

Thanks, Kenny

gregpoore commented 3 months ago

Hi @kennyyeo13, happy to help and apologies for any confusion. I'm attaching here the full list of OGUs and their respective taxonomic lineages for RS210 (total # is 29,648 OGUs). I have limited time the rest of this upcoming week but hope this addresses your question. RS210_lineages.csv

hermidalc commented 3 months ago

Hi @hermidalc, apologies for the delayed reply. I'm happy to help provide you with files but first have some comments:

* Newer, microbiome-specific batch-correction tools have been introduced since the 2020 paper, including [ConQuR](https://www.nature.com/articles/s41467-022-33071-9), which we recommended using in the Oncogene 2024 work for addressing the TCGA batch effects. Technical details aside, a key advantage of ConQuR is that it provides corrected outputs in counts, permitting usage of downstream microbiome analyses (rather than the log-adjusted values from Voom-SNM). However, its main limitation is the usage of only one batch variable at a time, so we had to subset TCGA data to Illumina HiSeq-only and either WGS or RNA-Seq to use it. This means there are separate ConQuR-corrected tables for WGS and RNA-Seq data, and the total number of samples is less due to the Illumina HiSeq subsetting.

* The additional rounds of stringent host depletion (GRCh38, T2T, HPRC, GENCODE) most affected RNA-Seq samples from UNC, which were sequenced less deeply than other samples in TCGA. This was complicated by how many RNA-Seq samples were generated with 50-bp sequencing, which can make host filtration overly sensitive (ie, removal of putative microbial reads with any homology to human regions). Since we aimed to be very conservative in the manuscript (ie, preferring high specificity of microbial read calls despite lower sensitivity), these processes cause the UNC samples to have poor quality, and we did not perform further analysis of them. Related to ConQuR, since the only other RNA-Seq center beyond UNC was Canada's Michael Smith Genome Sciences Centre (CMS), the CMS RNA-Seq data was not batch corrected.

* As another conservative measure, we focused only on human-associated species, and further restricted to those with high aggregate coverages (≥50%). These considerations reduce the read counts but lend greater confidence in their biological plausibility.

* As another conservative measure, we implemented direct alignments of host-depleted data to cleaned microbial genomes in RefSeq release 210 (ie, "RS210-clean"). These alignments should be more specific than Kraken's k-mer-based approach, although they may be less sensitive.

To summarize, the tables we currently recommend would be (i) raw or ConQuR-corrected, (ii) separate for WGS and RNA-Seq, (iii) comprise human-associated taxa with high coverage, and (iv) derive from direct alignments against RS210-clean. Table S13 of the paper provides the raw (WGS and RNA-Seq) abundances of RS210-clean taxa with ≥50% aggregate coverage (note: Table S6 contains the taxonomic names for the genome IDs in Table S13). However, I can find and share the ConQuR-corrected WGS table here if that would be helpful. Is that what you would prefer?

Dear Greg - sorry to bother again, could you provide the ConQuR-corrected WGS data matrix you referred to above? Thanks very much!

gregpoore commented 3 months ago

@hermidalc Apologies for the delayed reply. I'm happy to provide the ConQuR-corrected WGS data matrix. I believe the following is what you're looking for?

ConQuR-corrected WGS data matrix: rs210PanFinal5050_Nonzero_HiSeq_WGS_CQ_15Oct23.csv

Please let me know if not. Additionally, code of how this correction was performed is here.

hermidalc commented 3 months ago

@hermidalc Apologies for the delayed reply. I'm happy to provide the ConQuR-corrected WGS data matrix. I believe the following is what you're looking for?

ConQuR-corrected WGS data matrix: rs210PanFinal5050_Nonzero_HiSeq_WGS_CQ_15Oct23.csv

Please let me know if not. Additionally, code of how this correction was performed is here.

Thanks Greg, sorry if I'm missing it somewhere, but where do I find how the sampleids map to GDC UUIDs or your older "s" sample identifiers?

gregpoore commented 3 months ago

@hermidalc Quick reply for now: you should be able to use Table S14 of the Sepich-Poore et al. 2024 Oncogene manuscript (it will be the WGS subset of that metadata)

hermidalc commented 3 months ago

@hermidalc Quick reply for now: you should be able to use Table S14 of the Sepich-Poore et al. 2024 Oncogene manuscript (it will be the WGS subset of that metadata)

Sorry I just see here that the raw counts are likely Table S13 in the Oncogene supplemental? Definitely looks right as it's not the same as the CQ corrected counts.

gregpoore commented 3 months ago

Correct, Table S13 of the Oncogene supplement contains the raw data. Just note those are the human-associated hits that had >= 50% cumulative genome coverage

hermidalc commented 3 months ago

Correct, Table S13 of the Oncogene supplement contains the raw data. Just note those are the human-associated hits that had >= 50% cumulative genome coverage

Sorry for all the bother, as you can imagine we are very closely examining, comparing, and vetting data sources. We built our own pipeline (https://github.com/hermidalc/tcga-wgs-kraken-microbial-quant) that is well suited to the specific needs of our Nat Commun study, the biggest difference in our pipeline is using Kraken2 + Bracken instead of KrakenUniq, since we are using the output data to do microbiome analysis between samples/groups and want more accurate microbial abundance estimates in each sample. I don't know if you have any feedback on that given all your experience.

I compared our pipeline WGS raw counts against Ge et al. bioRxiv 2024 (the Salzberg lab continuation of Gihawi et al. https://github.com/yge15/TCGA_Microbial_Content) and we get similar data, coverage, and sparsity when looking at the overlapping samples at both species and genus levels. I would like to do a comparison against your Oncogene 2024 paper raw data. I think it would be your KrakenUniq RS210 raw counts that went through the additional host filtering steps? Is that available somewhere?

gregpoore commented 3 months ago

Hi @hermidalc, some quick comments:

For the discovery of pathogens in human patients, discussed in the next section, a read count threshold of 10 and unique k-mer count threshold of 1000 eliminated many background identifications while preserving all true positives, which were discovered from as few as 15 reads.

Note that there are issues with the above, as the cutoff vary directly with sequencing depth of a sample (see Fig. 4 in that paper). Also, for reasons unknown to me, neither Gihawi et al. or Ge et al. apply this filter—which is what made KrakenUniq useful—to their data, or at least they make no mention of which parameters they chose. The closest it is mentioned from what I've found is when they state the following on their Github:

  1. 16 August 2023: we have updated some of the values in Table S8 and removed more false positives; the changes are very minor.
hermidalc commented 3 months ago

Hi @hermidalc, some quick comments:

* I understand the desire to vet various data sources and hope we've been helpful making ours public to date.

* Per Kraken2 vs KrakenUniq: I'll be honest and say that I'm not quite sure why one would use k-mer based methods when the amount of data left after host depletion is palatable for nearly any direct genome aligner. The main advantage of using the Kraken approach was speed with some cons in sensitivity/specificity. But if speed is no longer needed, why sacrifice valuable information like direct genomic coverage (breath and depth), SNPs, etc.? For this reason, I prefer direct genome alignments. But to more directly answer your question: KrakenUniq aggregates meta-metrics (eg, unique number of k-mers found) to assist with filtering out "likely contaminants," or as quoted in its [paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1568-0):

Thank you for the detailed and very thoughtful responses! FYI they added the KrakenUniq algorithm to Kraken2 some time ago and it can be run alongside Kraken2, i.e. "Kraken2Uniq", with the option --report-minimizer-data. This will add two minimizer data result columns to the regular Kraken2 report, which is useful, as you said, for distinguishing real/spurious signals. See https://github.com/DerrickWood/kraken2/wiki/Manual#distinct-minimizer-count-information and for more background https://www.nature.com/articles/s41596-022-00738-y

You commentary on why use k-mer based approaches over direct metagenomic alignments when there aren't many reads to align makes a good point, definitely something worth looking into further.