maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
104 stars 40 forks source link

Improve binning of expression values for SCTransform #202

Closed mxposed closed 3 years ago

mxposed commented 3 years ago

Fixes #173

The previous algorithm binned unique expression values, implicitly assuming that for most cells expression values would be different. SCTransform log-expression values often are the same for low-expressing cells, so it did not produce an approximately equal binning. Here I bin cells instead of unique expression values, and the binning is more equal for SCTransform and almost unchanged for default log-norm values. The speed is approximately the same.

maximilianh commented 3 years ago

Oh. Have you tried this with a counts-based matrix? This is the way I used to have it and the binning for many 10X matrices and UMI counts would look like this:

bin0: 0-0 bin1: 0-0 bin2: 0-0 bin3: 0-0 etc.

Maybe we could add an option for the binning method in cellbrowser.conf? Or maybe auto-detect the optimal way in cbBuild?

mxposed commented 3 years ago

I'll check with the counts data

maximilianh commented 3 years ago

I'm pretty sure it won't look good with counts data. Counts data is 99% 0s.

mxposed commented 3 years ago

Please, check here: https://mxposed.github.io/cellBrowser/?ds=pbmc3k&gene=MALAT1 This is 10x PBMC 3k dataset with exported counts.

I'm pretty satisfied with how it turned out

maximilianh commented 3 years ago

Hey Nikolay, this looks great, so did you make another change now or does this simply rely on the existing special treatment for the value 0?

This counts dataset looks very different from the ones I have looked at, the bigger low depth 10x datasets sometimes have 80% 0s and 5% 1s... did you happen to try with another 10X dataset?

On Tue, Dec 29, 2020 at 6:12 AM Nikolay Markov notifications@github.com wrote:

Please, check here: https://mxposed.github.io/cellBrowser/?ds=pbmc3k&gene=MALAT1 This is 10x PBMC 3k dataset with exported counts.

I'm pretty satisfied with how it turned out

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/pull/202#issuecomment-751948791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TPR46UGOI2IAOZ6AXDSXFQLXANCNFSM4VCXYPYQ .

mxposed commented 3 years ago

I changed the base of this PR from master to develop.

Hey Nikolay, this looks great, so did you make another change now or does this simply rely on the existing special treatment for the value 0?

I did not make any additional changed. It treats 0 as a special value, as before. This approach counts the ideal number of cells to place in each bin, and then finds the next break for bin values so that the number of cells is close to the ideal. Then it recalculates the ideal number for the remaining bins.

This counts dataset looks very different from the ones I have looked at, the bigger low depth 10x datasets sometimes have 80% 0s and 5% 1s... did you happen to try with another 10X dataset?

Have you tried other genes? For instance https://mxposed.github.io/cellBrowser/?ds=pbmc3k&gene=PPBP or https://mxposed.github.io/cellBrowser/?ds=pbmc3k&gene=CCR7

maximilianh commented 3 years ago

Ohhh! Nice, this new commit is interesting... yes, just checking for a value change at the break is a nice idea, I didn't have that idea back then. Thanks!