Would like to use harmonized PC as input to minimize batch effect

hurleyLi commented 2 years ago

Hi, We recently tried MC2 on a dataset with known batch effect, and so each resulting metacell are clearly composed of cells only from a single sample. We noticed this behavior of MC2 when analyzing a few public datasets as well, e.g. from Smillie et al. Cell.. Cells from subject N661 only made metacells by its own, i.e. sample-specific.

I'm wondering whether it's possible to use harmonized PC as input to calculate similarity rather than using raw counts (while the resulting "counts" are still aggregated by raw counts). I'm digging through your code and thinking maybe this could be implemented somewhere in _compute_elements_similarity(). However, I can't simply change the what parameter to a _harmonized_PCAlayer since we only want to use 50PCs, which has a different dimension than the cell X gene matrix.

Could you please add in a feature such as calculate similarity based on adata.obsm['X_pca'], or provide some guidance on how to implement it ourselves.

Thanks a lot! Hurley

orenbenkiki commented 2 years ago

In general we believe using PCA coefficients as input for MC is "the wrong thing to do". MCs allow us to group together cells while taking into account the full richness of the cells gene expression; each MC gives us a robust estimation of the expression of all the genes. Once MCs have been computed, one can use various techniques to group them into cell types, which naturally discards a lot of this information.

If you are dead set on computing MCs on PCA coefficients, you can create a new AnnData object that contains the same number of observations where the variables are the PCA coefficients, compute MCs on this, and then just copy the metacell indices per-observation annotation to the original AnnData (where the variables are genes).

hurleyLi commented 2 years ago

Thanks for the explanation!

If you are dead set on computing MCs on PCA coefficients, you can create a new AnnData object that contains the same number of observations where the variables are the PCA coefficients, compute MCs on this, and then just copy the metacell indices per-observation annotation to the original AnnData (where the variables are genes).

As PC values are not integer and some of them are negative values, do you think we can directly use PC eigenvectors as input to MC2? or is it necessary to make them look like "raw counts"? Thanks!

hurleyLi commented 2 years ago

Still a little confused here ..

MCs allow us to group together cells while taking into account the full richness of the cells gene expression

One of the steps in MC is selecting feature genes, so not sure what you mean by "taking into account the full richness". PCs also "take into account the full richness" in that way. Also PCs will capture the "full richness" while getting rid of the noise in single-cell data.

each MC gives us a robust estimation of the expression of all the genes. Once MCs have been computed, one can use various techniques to group them into cell types, which naturally discards a lot of this information.

This still doesn't really solve the problem of sample-specific effect. Are you suggesting directly merging MCs from different samples followed by some sort of batch correction?

orenbenkiki commented 2 years ago

It is true we pick feature genes for computing the KNN graph but outlier detection looks at all the genes. Also, due to the divide-and-conquer algorithm, feature selection is adaptive (so in a pile of only T-cells, feature selection would be for genes that distinguish between sub-types of T-cells and ignore the genes that separate T-cells from other types). These 2nd level of detail genes might very well be discarded as "noise" by PCA. This is more of an issue for rate cell types (that is, when working on piles that contain outliers-of-outliers-of-... etc.).

Sample-specific effects - yes, one should correct for batch effects before computing MCs.

PCs being fractional and negative - good point. But since we compute correlations the result is not sensitive to linear transformations so you could convert it to something that looks "like" counts. You might want to set cells_similarity_log_data=False here.

hurleyLi commented 2 years ago

Thanks for the suggestion!

tanaylab / metacells

Would like to use harmonized PC as input to minimize batch effect #22