scverse / muon

muon is a multimodal omics Python framework
https://muon.scverse.org/
BSD 3-Clause "New" or "Revised" License
208 stars 30 forks source link

Clearer documentation for `ac.tl.lsi` #87

Open alexlenail opened 1 year ago

alexlenail commented 1 year ago

It looks like the function ac.tl.lsi is mostly just a call to scipy.sparse.linalg.svds.

https://github.com/scverse/muon/blob/d0e2c22896a6ea1c5617ed791ce4bf8b6ae1ab55/muon/_atac/tools.py#L52

I think this function expects that you've run ac.pp.tfidf beforehand, but that isn't mentioned in the docstring of the function / the API docs.

This is especially confusing because Muon's normalization docs read:

One of them is constructing term-document matrix from the original count matrix. This is typically followed by singular value decomposition (SVD) — the same technique that conventional principal component analysis (PCA) uses — to generate latent components, and these two steps together are referred to as latent semantic indexing (LSI).

From that explanation, you would expect a function called lsi() to run tf-idf() and then svd() -- so would take a raw counts matrix as input. But it seems like the function lsi() expects you to have already run tf-idf().

Clarification of the docs would help here.

gtca commented 1 year ago

Thanks for noting that, @alexlenail!

It might be also worthwhile to repurpose ac.tl.lsi() to run both TF-IDF and SVD while adding ac.tl.svd(). This should also make it more intuitive for the users of other commonly used toolkits for scATAC-seq processing.

We're currently refactoring the ATAC module, and that might be a welcome change to make these steps more intuitive. If you have more thoughts on this or other matters, please do share, whether here or e.g. on Zulip (https://scverse.org/join/).