maximilianh / cellBrowser

main repo: https://github.com/ucscGenomeBrowser/cellBrowser/ - Python pipeline and Javascript scatter plot library for single-cell datasets, http://cellbrowser.rtfd.org
https://github.com/ucscGenomeBrowser/cellBrowser/
GNU General Public License v3.0
104 stars 40 forks source link

color buckets distribution after sctransform normalization #173

Closed cotedivoir closed 2 years ago

cotedivoir commented 4 years ago

Hi!

I have a weird issue with how color gradient buckets are distributed.

This is an example on how normal distribution should be:

normal

This is what I get after i normalize and scale the same Seurat object using Sctransform function:

skewed

Sctransform: https://satijalab.org/seurat/v3.1/sctransform_vignette.html

Any idea why this happens?

Thank you very much in advance! And thanks again for working on cellbrowser!

maximilianh commented 4 years ago

Hi! Thanks!

Is it possible that 2.71 is a value that is very common in your expression vector ? Is this a public dataset that I could look at ?

On Sat 2 May 2020 at 21:57, Anastasia notifications@github.com wrote:

Hi!

I have a weird issue with how color gradient buckets are distributed.

This is an example on how normal distribution should be: [image: normal] https://user-images.githubusercontent.com/60413394/80890320-bf7d2980-8c73-11ea-8886-4dc3da4ac97c.png This is what I get after i normalize and scale the same Seurat object using Sctransform function: [image: skewed] https://user-images.githubusercontent.com/60413394/80890649-d6bc1700-8c73-11ea-8d58-01101b046b21.png

Sctransform: https://satijalab.org/seurat/v3.1/sctransform_vignette.html

Any idea why this happens?

Thank you very much in advance! And thanks again for working on cellbrowser!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TIIBI32H52ZDB5P5B3RPR3JZANCNFSM4MX2BPBA .

cotedivoir commented 4 years ago

Is it possible that 2.71 is a value that is very common in your expression vector ?

no, doesn't look like that

Is this a public dataset that I could look at ?

no, unfortunately i can't publish it right now :(

maximilianh commented 4 years ago

I can pull out the code but the way I do it is to sort the values and then take the 0 percentile, the 10 percentile, 20 percentile etc. for the breaks between the bins. I copied the code from R I think.

On Sat 2 May 2020 at 22:16, Anastasia notifications@github.com wrote:

Is it possible that 2.71 is a value that is very common in your expression vector ?

no, doesn't look like that

Is this a public dataset that I could look at ?

no, unfortunately i can't publish it right now :(

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173#issuecomment-623007702, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMDP7YZTA27T4DT3STRPR5QXANCNFSM4MX2BPBA .

cotedivoir commented 4 years ago

I don't know what's going on. I took pbmc3k dataset and run both standard Seurat workflow and then SCtransform workflow. For standard everything binned normal. For SCT the same issue. I can upload both examples somewhere so you could see?

maximilianh commented 4 years ago

Yes, can you share the seurat object, either you upload it to a webserver somewhere, or you send it to me, max@soe.ucsc.edu ? thanks!

On Sat, May 16, 2020 at 1:22 AM Anastasia notifications@github.com wrote:

I don't know what's going on. I took pbmc3k dataset and run both standard Seurat workflow and then SCtransform workflow. For standard everything binned normal. For SCT the same issue. I can upload both examples somewhere so you could see?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173#issuecomment-629545994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TPCUBPCXFGST23OVODRRXFEPANCNFSM4MX2BPBA .

maximilianh commented 4 years ago

Can you also tell me how you imported the dataset to the cell browser? Did you use cbImportSeurat?

On Tue, May 19, 2020 at 4:16 PM Maximilian Haeussler maximilianh@gmail.com wrote:

Yes, can you share the seurat object, either you upload it to a webserver somewhere, or you send it to me, max@soe.ucsc.edu ? thanks!

On Sat, May 16, 2020 at 1:22 AM Anastasia notifications@github.com wrote:

I don't know what's going on. I took pbmc3k dataset and run both standard Seurat workflow and then SCtransform workflow. For standard everything binned normal. For SCT the same issue. I can upload both examples somewhere so you could see?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173#issuecomment-629545994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TPCUBPCXFGST23OVODRRXFEPANCNFSM4MX2BPBA .

cotedivoir commented 4 years ago

Hi! Thank you so much! I sent Seurat object to you. I use ExportToCellBrowser in R to prepare files.

maximilianh commented 4 years ago

This may have to do with the way you export, I remember having changed something for sctransform objects half a year ago... I’ll try if using cbImportSeurat on your file fixes the problem.

We also started working today on a Seurat pull request for the next Seurat release to update exportToCellbrowser.

On Thu 21 May 2020 at 04:43, Anastasia notifications@github.com wrote:

Hi! Thank you so much! I sent Seurat object to you. I use ExportToCellBrowser in R to prepare files.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173#issuecomment-631844962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TJY5NMVDM6MMBCYFVDRSSILLANCNFSM4MX2BPBA .

maximilianh commented 4 years ago

So I ran cbImportSeurat on your pbmc data file and got this result for ACTB. It looks reasonable to me...

image

maximilianh commented 4 years ago

Here are the most common values of the ACTB expression vector, as you can see, there is no perfect way to bin them, as the distribution is so skewed to certain values.

$ zgrep ACTB exprMatrix.tsv.gz | tr '\t' '\n' | grep -v ACTB | sort | uniq -c | sort -rn | head

5119 0 287 1 168 8 160 10 146 13 144 9 140 7 128 11 124 5 121 6 119 14 109 12 107 15 105 4 82 18 80 17 79 16 78 3 72 19 68 20 66 2 49 23

cotedivoir commented 4 years ago

Sorry, it took time to get back to it. I think I understand what you mean and can see where the issue is.

Here I attach frequency table for the data slot for ACTB for the same pbmc3k dataset: one is for RNA assay, another is for SCT assay. I did not look into how exactly SCTransform function work. But apparently it unifies the values so there are fewer variants (with higher frequencies). And you split them into bins based on values, not frequencies.

Is there any way to consider the number of the cells in each bin? So they are more or less equally distributed? ACTB_SCT.tsv.txt ACTB_RNA.tsv.txt

maximilianh commented 4 years ago

I do consider the number of values. But if there are only very few values I cannot do a better job. Imagine if you only have the values 0, 10 and 50 and 10 appears 30 times and 0 and 50 appear three times. There is no good way to bin such a list. It’s similar here. Too many identical values throw off any binning.

Or do you see a better way to bin here ?

On Fri 29 May 2020 at 00:58, Anastasia notifications@github.com wrote:

Sorry, it took time to get back to it. I think I understand what you mean and can see where the issue is.

Here I attach frequency table for the data slot for ACTB for the same pbmc3k dataset: one is for RNA assay, another is for SCT assay. I did not look into how exactly SCTransform function work. But apparently it unifies the values so there are fewer variants (with higher frequencies). And you split them into bins based on values, not frequencies.

Is there any way to consider the number of the cells in each bin? So they are more or less equally distributed? ACTB_SCT.tsv.txt https://github.com/maximilianh/cellBrowser/files/4698686/ACTB_SCT.tsv.txt ACTB_RNA.tsv.txt https://github.com/maximilianh/cellBrowser/files/4698687/ACTB_RNA.tsv.txt

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/maximilianh/cellBrowser/issues/173#issuecomment-635652583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TN24I6XABY4HMYZ4RTRT3UAPANCNFSM4MX2BPBA .

cotedivoir commented 4 years ago

I see. Using scale slot should be a workaround. I will think about binning too. Thank you for helping!

mxposed commented 3 years ago

I ran into this and I think I have a solution (in the linked pull request).

Here's how the ACTB expression looks in my (SCTransformed) dataset: Screen Shot 2020-12-19 at 20 24 18

SCTransform produces same expression values for low-expressing cells, but different expression values in the upper range.

matthewspeir commented 2 years ago

I think we can close this based on the fact that @mxposed's change was merged in and has been out there for ~1.5 years.