gregpoore / tcga_rebuttal

Re-analysis of data provided by Gihawi et al. 2023 bioRxiv
24 stars 7 forks source link

Full dataset for 32 TCGA cancers from Oncogene 2024 new pipeline #3

Open hermidalc opened 1 week ago

hermidalc commented 1 week ago

Dear Greg - thanks for your continued work on this and for this follow up paper with updated pipeline. I'm the lead author of Hermida et al. Nat Commun 2022 which used your original Poore et al. Nature 2020 dataset, specifically the "Kraken-TCGA-Voom-SNM-Plate-Center-Filtering-Data.csv” and “Metadata-TCGA-Kraken-17625-Samples.csv” from ftp://ftp.microbio.me/pub/cancer_microbiome_analysis.

Do you have the microbial abundances from this updated Oncogene 2024 pipeline for all 32 TCGA cancers in a format similar to the two files mentioned above?

gregpoore commented 1 week ago

Hi @hermidalc, apologies for the delayed reply. I'm happy to help provide you with files but first have some comments:

To summarize, the tables we currently recommend would be (i) raw or ConQuR-corrected, (ii) separate for WGS and RNA-Seq, (iii) comprise human-associated taxa with high coverage, and (iv) derive from direct alignments against RS210-clean. Table S13 of the paper provides the raw (WGS and RNA-Seq) abundances of RS210-clean taxa with ≥50% aggregate coverage (note: Table S6 contains the taxonomic names for the genome IDs in Table S13). However, I can find and share the ConQuR-corrected WGS table here if that would be helpful. Is that what you would prefer?

Edit: There are creative ways to get ConQuR to correct for >1 batch variable at a time, such as concatenating multiple factors into a single vector in R, but we did not publish this data (in part because of limited time to reassure ourselves the correction acted appropriately). In theory, doing this would help create the kind of single abundance table you're asking about. If this is something you want, I can point you in the right direction.

hermidalc commented 4 days ago

Thanks very much Greg for the detailed breakdown and explanation. After giving it some thought I believe the last option would be ideal, a joint ConQuR abundance table for all the samples. I would definitely appreciate any advice or help to create this matrix by as you said batch correcting for >1 factor with ConQuR