Batch correction concerns and questions

gregpoore commented 11 months ago

Hi @gtonkinhill, thanks for the really detailed analyses. This looks like it took a lot of time to put together, and I genuinely appreciate the public release on Github. I have two concerns and some questions after that.

Concerns

The first section uses simulations with "cancer type" (i.e., disease_type or investigation in TCGA metadata) as the biological variable during SNM to conclude that "it will artificially increase its accuracy and incorrectly identify a signal associated with the fake variable of interest." However, the paper in question (Poore and Kopylova et al.) did not use cancer type as the biological variable, nor did it use cancer type elsewhere in the normalization. Rather, it used "sample type" (e.g., "Primary Tumor", "Blood Derived Normal") as the biological variable. This was stated in the paper's Methods and in the released Github code (here in cells 10-11)):

[Quoted from the Methods] The Voom and SNM model matrices were equivalent and built using sample type as the target biological variable (n = 7; for example, primary tumour tissue) owing to expected biological differences between them, for which signal should be preserved during the SNM; conversely, the following were modelled as technical covariates to be mitigated during SNM: sequencing centre (n = 8), sequencing platform (n = 6), experimental strategy (n = 2), tissue source site (n = 191), and FFPE status (n = 2; ‘yes’ or ‘no’). It was not possible to model disease type as the target biological variable owing to complete confounding between certain types of cancer and sequencing centres (that is, some types of cancer were only sequenced at one TCGA site).

As an aside, the "hospital" source of variation (i.e., tissue_source_site_label) was also included as a technical factor to be removed in the original normalization and contributed much less noise than sequencing center or sequencing platform did (see Fig 1e in the original paper). Additionally, "sample type" was approximately evenly distributed across the sequencing centers (i.e., virtually all centers that sequenced tumors also sequenced their adjacent normal tissues). In general, I like how the simulations show necessary caution about information leak during pre-processing stages, including batch correction, but they are distinct from how the TCGA data were analyzed.

I really like the idea of applying SCRuB to TCGA and appreciate your transparent note that "[TCGA's] ‘normal’ samples are not an ideal control. If cancer associated bacteria are also present in the normal tissue this would remove such a signal." Your concern is indeed my concern, as the largest study on this topic by number of experimental contamination controls (Nejman et al. 2020 Science) found very strong compositional similarities between tumor and normal adjacent tissues (NATs). For example, their Figure 4F based on Jaccard similarity:

If it helps, these NATs in TCGA or Nejman et al. are not true normals and likely have distinct microbial compositions from true normals (perhaps due to field cancerization but unclear to date). Given the tumor~NAT similarity, there are potentially other ways to apply SCRuB, but its application in this way is likely removing true signal.

Questions that I'm interested in:

In your figure that shows the hospital distribution for HMS BLCA, BRCA, and HNSC, you correctly note that many of the hospitals do not overlap between those 3 cancer types and suggest "other potential batch effects could be driving the signal including the hospital the samples originated from." However, if "hospital" (i.e., tissue_source_site_label) is contributing such a large effect, why do samples within each cancer type from different hospitals look so similar to each other? Or, by way of example, why do the BRCA samples cluster so tightly to each other in the MDS plot if they came from 6 distinct hospitals spread throughout the country (e.g., Duke, Mayo, MSKCC), if hospital is the primary confounder? Shouldn't hospital-specific variation within cancer types make it harder, not easier, to distinguish them?
In my limited use of removeBatchEffect() as an unsupervised normalizer, I have often found that it removes biological signal in addition to the batch effect (though Wang & LeCao 2020 may suggest otherwise). It is also intended for RNA-Seq data rather than microbiome data. Is it worth trying to apply a microbiome-specific batch correction method (e.g., ConQuR or equivalent)?
For the MDS plots (or equivalent), is it possible to include quantitative effect size estimates of the clustering?

gtonkinhill commented 11 months ago

Hi @gregpoore, thanks for considering my analysis and apologies if I have misrepresented any of the methods in the original paper. I will go through your comments carefully and update the blog as necessary.

Gscorreia89 commented 10 months ago

Hi,

Really nice to see these type of transparent analyses and discussions, and love the github format for this. Just wanted to share some (hopefully useful!) comments: 1 - This classic reference feels very relevant here. Following from your logistic model, I think it might be a good idea to downsample the full dataset using propensity score matching or similar approaches (e.g. using something like MatchIt) to re-balance and remove the association between the batch correction covariates and sample type (or even cancer type!) and repeat the analysis. 2 - "Batch/centre/hospital" effects in clinical research are never really about the literal "hospital". In other words, within a given hospital/geographical location, distinct clinical services, wards, teams, or even surgical theatres used are what matters. "Hospital/source_site" is mainly an agregate of all that + population effects, which are likely not relevant here. We really need specific negative/blank controls for all locations were the collection happens. The closest thing to that are the matched "normal" samples, but even then it would be important to track down the exact collection protocols for their collection and evaluate its suitability on a per centre/tissue_type basis. Overall, I also don't think TGCA dataset has the correct design to explore this question.

clee700 commented 10 months ago

Hi, thank you for putting all this together. I have an additional comment. Like you, I was concerned that if disease type had been used in the snm normalization, that might create false signals. I looked into their code for voom-snm normalization. Specifically, I looked into how they created their design matrices on their github

I think 'disease_type_consol' means disease type consolidated. I searched their metadata to find an example of 'disease_type_consol'. I didn't check every metadata file, but I found two with 'disease_type_consol' in the TCGA/Kraken folder. When I checked these labels against 'disease_type', they line up almost exactly.

This suggests to me that actually disease type was used in the normalization process, although it may not have been intentional. The 'disease_type_consol' does not exist for every metadata file, so it's also unclear how frequently this error may have occurred.

gtonkinhill / TCGA_analysis

Batch correction concerns and questions #2