RunSVD cell embeddings not fully transformed?

stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data

https://stuartlab.org/signac/

Other

327 stars 88 forks source link

RunSVD cell embeddings not fully transformed? #1247

Closed danielcgingerich closed 1 year ago

danielcgingerich commented 2 years ago

I see in the source code that the cell embeddings are the left singular vectors of the SVD. However, what about multiplying these by the singular values? This stack exchange post, relating SVD to PCA says

svd of matrix M is the factorization: M = USV, where U = left singular vectors, S = singular values, V = right singular vectors
the principal components (transformed data) are defined by US

In the source code, U is not multiplied by S:

components <- irlba(A = t(x = object), nv = n, work = irlba.work)
  feature.loadings <- components$v
  sdev <- components$d / sqrt(x = max(1, nrow(x = object) - 1))
  cell.embeddings <- components$u

timoast commented 2 years ago

This is intentional, we are not performing PCA here (the input matrix is also not scaled and centered, which would be required for this to be equivalent to PCA)

danielcgingerich commented 2 years ago

Why is this? The wikipedia on LSI also mentions transforming the singular vectors by multiplying by sigma before comparing them. Shouldnt this be done on the embeddings before clustering? "See how related documents j and q are in the low-dimensional space by comparing the vectors Sigma (d_j) and Sigma (d_q)(typically by cosine similarity)" https://en.wikipedia.org/wiki/Latent_semantic_analysis#:~:text=the%20intended%20sense.-,Derivation,-%5Bedit%5D

timoast commented 2 years ago

There are many "variants" of LSI, and we based the method implemented in Signac on what was described by Cusanovich et al. This gives equal weighting to each component, and may help to resolve smaller cell populations whose variation is captured in later LSI components (smaller singular values). Empirically this seems to perform well on scATAC-seq data