Open gregpoore opened 11 months ago
Hi @gregpoore, thanks for considering my analysis and apologies if I have misrepresented any of the methods in the original paper. I will go through your comments carefully and update the blog as necessary.
Hi,
Really nice to see these type of transparent analyses and discussions, and love the github format for this. Just wanted to share some (hopefully useful!) comments: 1 - This classic reference feels very relevant here. Following from your logistic model, I think it might be a good idea to downsample the full dataset using propensity score matching or similar approaches (e.g. using something like MatchIt) to re-balance and remove the association between the batch correction covariates and sample type (or even cancer type!) and repeat the analysis. 2 - "Batch/centre/hospital" effects in clinical research are never really about the literal "hospital". In other words, within a given hospital/geographical location, distinct clinical services, wards, teams, or even surgical theatres used are what matters. "Hospital/source_site" is mainly an agregate of all that + population effects, which are likely not relevant here. We really need specific negative/blank controls for all locations were the collection happens. The closest thing to that are the matched "normal" samples, but even then it would be important to track down the exact collection protocols for their collection and evaluate its suitability on a per centre/tissue_type basis. Overall, I also don't think TGCA dataset has the correct design to explore this question.
Hi, thank you for putting all this together. I have an additional comment. Like you, I was concerned that if disease type had been used in the snm normalization, that might create false signals. I looked into their code for voom-snm normalization. Specifically, I looked into how they created their design matrices on their github
I think 'disease_type_consol' means disease type consolidated. I searched their metadata to find an example of 'disease_type_consol'. I didn't check every metadata file, but I found two with 'disease_type_consol' in the TCGA/Kraken folder. When I checked these labels against 'disease_type', they line up almost exactly.
This suggests to me that actually disease type was used in the normalization process, although it may not have been intentional. The 'disease_type_consol' does not exist for every metadata file, so it's also unclear how frequently this error may have occurred.
Hi @gtonkinhill, thanks for the really detailed analyses. This looks like it took a lot of time to put together, and I genuinely appreciate the public release on Github. I have two concerns and some questions after that.
Concerns
disease_type
orinvestigation
in TCGA metadata) as the biological variable during SNM to conclude that "it will artificially increase its accuracy and incorrectly identify a signal associated with the fake variable of interest." However, the paper in question (Poore and Kopylova et al.) did not use cancer type as the biological variable, nor did it use cancer type elsewhere in the normalization. Rather, it used "sample type" (e.g., "Primary Tumor", "Blood Derived Normal") as the biological variable. This was stated in the paper's Methods and in the released Github code (here in cells 10-11)):As an aside, the "hospital" source of variation (i.e.,
tissue_source_site_label
) was also included as a technical factor to be removed in the original normalization and contributed much less noise than sequencing center or sequencing platform did (see Fig 1e in the original paper). Additionally, "sample type" was approximately evenly distributed across the sequencing centers (i.e., virtually all centers that sequenced tumors also sequenced their adjacent normal tissues). In general, I like how the simulations show necessary caution about information leak during pre-processing stages, including batch correction, but they are distinct from how the TCGA data were analyzed.If it helps, these NATs in TCGA or Nejman et al. are not true normals and likely have distinct microbial compositions from true normals (perhaps due to field cancerization but unclear to date). Given the tumor~NAT similarity, there are potentially other ways to apply SCRuB, but its application in this way is likely removing true signal.
Questions that I'm interested in:
In your figure that shows the hospital distribution for HMS BLCA, BRCA, and HNSC, you correctly note that many of the hospitals do not overlap between those 3 cancer types and suggest "other potential batch effects could be driving the signal including the hospital the samples originated from." However, if "hospital" (i.e.,
tissue_source_site_label
) is contributing such a large effect, why do samples within each cancer type from different hospitals look so similar to each other? Or, by way of example, why do the BRCA samples cluster so tightly to each other in the MDS plot if they came from 6 distinct hospitals spread throughout the country (e.g., Duke, Mayo, MSKCC), if hospital is the primary confounder? Shouldn't hospital-specific variation within cancer types make it harder, not easier, to distinguish them?In my limited use of
removeBatchEffect()
as an unsupervised normalizer, I have often found that it removes biological signal in addition to the batch effect (though Wang & LeCao 2020 may suggest otherwise). It is also intended for RNA-Seq data rather than microbiome data. Is it worth trying to apply a microbiome-specific batch correction method (e.g., ConQuR or equivalent)?For the MDS plots (or equivalent), is it possible to include quantitative effect size estimates of the clustering?