shuxiaoc / maxfuse

Other
40 stars 8 forks source link

Batch correction, use of labels, and pre-processing to align populations #13

Open dr-michael-haley opened 1 month ago

dr-michael-haley commented 1 month ago

Hi,

I am adapting your second tutorial to integrate IMC data with scRNAseq.

Do you have any advice for incorporating batch correction, particularly when working with protein data (specifically, IMC)? We commonly use BBKNN or Harmony through Scanpy, though since these compensate at the neighbours or PCA stages, they do not affect the raw values that are provided to MaxFuse. However, any labels provided will have been generated on the batch corrected labels, so I assume that will provide some degree of batch correction during the smooth steps? Theoretically, we could use a batch correction method that adjusts the raw data (e.g. Combat)...

Secondly, does it matter if the labels do not match between the modalities? I realise if the names are the same it will make it easier to interpret the accuracy of the matching (e.g. using the ConfusionMatrix), but does it otherwise impact the matching?

Should the labels be matched as closely as possible between modalities in terms of how specific the labels are? For example, 'T cells' in one dataset, and 'CD4 T-cells', 'CD8 T-cells' in another.

How do you handle populations which are going to be undetectable in one of your datasets? For example, we may not have markers in our panel to identify some populations in the tissue, but they will appear in the scRNAseq dataset.

Thanks for creating such an amazing tool, it's really impressive and a huge step forward from previous attempts!

BokaiZhu commented 2 weeks ago

Hi,

Thanks for your interest in the method!

Question 1 about batch correction: Yes, this is indeed a very important point. The raw counts will not be affected, but only the clustering labels will be affected. As you described, if labels are generated from batch-corrected embeddings, they provide some level of smoothing. Also, the batch-corrected embeddings can be used in the refined-matching step as 'features,' which can provide additional 'batch-smoothing' to the matching results. For example, in our Hubmap tri-modality integration, we have used harmony-corrected PCs (e.g., scATAC-LSI) as input for refined matching. As you mentioned, raw values can be 'corrected' and used during initial matching. This could theoretically work, but we have not tested this comprehensively.

Question 2, Do the labels need to be matched? No, it is not required. The matching labels in our manuscript are only for benchmarking reasons (e.g., to calculate confusion matrices). For actual running, the labels do not need to be name-matched.

Question 3, How do we handle missing cell types? If the missing cell types are not known when running the matching, the method itself handles this issue to some degree (e.g., filtering out the low-quality matching cells, likely due to no matching in the other modality). However, if missing cell types are 'prior knowledge,' it would be best practice to remove them from matching. We think that in scenarios where good matching is not possible for these cells, it would be natural to remove them from our matching tasks.

Please let me know if this clarification makes sense and if you have any additional questions!

Best, Bokai