immunogenomics / harmony

Fast, sensitive and accurate integration of single-cell data with Harmony
https://portals.broadinstitute.org/harmony/
Other
513 stars 98 forks source link

Can i obtain my original data shape? #247

Open leeleavitt opened 6 months ago

leeleavitt commented 6 months ago

I would like to use Harmony to normalize my data, but i need the original shape to use in other part of my analytical pipeline.

Harmony takes as input principal components ($PC$), and outputs corrected principal components ($PC'$).

All applications I've seen using Harmony takes the top $k$ ranks of principal components. Since I need the original data structure, I would input all principal components, assuming my assumptions below are accurate.

The general approach I am considering, is creating my principal components using singular value decomposition (SVD).

$$A = U S V^T$$

Where $U$ and $V$ are orthogonal matrices, and $S$ is a diagonal matrix containing the singular values.

Assuming $U * S$ can be represented as all possible principal components $PC$. Through Harmony normalization, we transform $PC$ into $PC'$.

Harmony normalizes all principal components

$PC \rightarrow Harmony \rightarrow PC'$

I then assume

$PC'$ $\equiv$ $U' S'$ $\equiv$ $(U S)'$

I then reconstruct the original shape of my data, but now the data is normalized,

$(U S)' V^T = A'$

Is this valid?

hongchengyao commented 3 months ago

Hi @leeleavitt , thanks for using harmony! Please correct me if my understanding is wrong: I think you want to use harmony to do batch correction for the original data (count matrix or log normalized data) instead of PCs, i.e., 1) convert the original data (count matrix or log normalized data) to PCs, 2) use harmony to perform batch correction on PCs to get corrected PCs, 3) convert the corrected PCs back to original data format (count matrix or log normalized data).

leeleavitt commented 3 months ago

Yes exactly

hongchengyao commented 3 months ago

Hi @leeleavitt , first I think what you proposed is doable and valid in terms of the equation, but there are mainly two issues associated with this idea.

1) Computation feasibility. I'm not sure about the size of your input matrix (number of genes by number of cells), especially the number of genes. Harmony is optimized for input a small number of PCs (usually just 20), so including all PCs (i.e. equal to the number of genes) in the input may make it substantially slower and would consume much more memory than designed.

2) batch correction performance at the original data format level. Although it's possible to convert the corrected PCs back to the original data format, harmony is never tested for this scenario and we can't promise anything about the performance, especially for downstream analysis like DEG.

Of course, there is nothing to prevent you from using harmony this way and my suggestion would be to reduce the number of PCs as much as possible if you encounter computational problems. Depending on your purpose, it may not be too bad to approximate with the top N PCs