Closed danielcgingerich closed 1 year ago
This is intentional, we are not performing PCA here (the input matrix is also not scaled and centered, which would be required for this to be equivalent to PCA)
Why is this? The wikipedia on LSI also mentions transforming the singular vectors by multiplying by sigma before comparing them. Shouldnt this be done on the embeddings before clustering? "See how related documents j and q are in the low-dimensional space by comparing the vectors Sigma (d_j) and Sigma (d_q)(typically by cosine similarity)" https://en.wikipedia.org/wiki/Latent_semantic_analysis#:~:text=the%20intended%20sense.-,Derivation,-%5Bedit%5D
There are many "variants" of LSI, and we based the method implemented in Signac on what was described by Cusanovich et al. This gives equal weighting to each component, and may help to resolve smaller cell populations whose variation is captured in later LSI components (smaller singular values). Empirically this seems to perform well on scATAC-seq data
I see in the source code that the cell embeddings are the left singular vectors of the SVD. However, what about multiplying these by the singular values? This stack exchange post, relating SVD to PCA says
In the source code, U is not multiplied by S: