Why is CPM the default normalization?

openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics

MIT License

287 stars 76 forks source link

Why is CPM the default normalization? #773

Closed adamgayoso closed 1 year ago

adamgayoso commented 1 year ago

https://github.com/openproblems-bio/openproblems/blob/3d8964a6c02496c0c604f0b1ddadc40589ca43a8/openproblems/tools/normalize.py#L44-L49

It's much more standard to use CP10k, or counts per median lib size. CPM might distort 0s vs non0s heavily.

LuckyMD commented 1 year ago

Agreed that 10k would be better. I recall a recent paper (from Lior?) arguing something like this as well. Either way, I'd prefer scran, but that's slow... Do you want to open a PR?

lazappi commented 1 year ago

I think this happened because for some reason it was included in the dimensionality reduction task and then transferred to everything when we made these generic functions. Always seemed weird to me as well so happy to have it replaced with something more standard.

LuckyMD commented 1 year ago

We have log_scran implemented in utils elsewhere as well... but that might make this task a lot slower. 10k is fine.

scottgigante-immunai commented 1 year ago

I'm fine with this change. Worth noting that the PR would have to change many text references to CPM in method names, function names and the like.