Question about Reproducing a Plot in the Paper Introducing the SCTransform (Hafemeister, C., & Satija, R. #1999)

mcap91 commented 2 years ago

I am copying the issue from sainadfensi summited on Aug 19, 2019 to the Seurat repo in regards to the x and y axes in Figures 1C, D and Fig 3A from Hafemeister, C., & Satija, R.

I can not find an answer or any information pertaining to the topic and I would also like to reproduce these figures.

Thank you

Dear Team,

I am learning the single-cell RNA-seq with the Seurat package and I highly appreciate the comprehensive package with interpretable visualizations and easy operation. Thank you for building up this package with such consideration and preciseness. Also, I value the new released Seurat v3 with the novel statistical approach for normalization called sctransform.

In the paper[1] about the sctransform, multiple statistical methods are applied to show the convincing effectiveness with loads of plots. In particular, I am interested in the figures in the paper which show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper). However, when I try to repeat the plot by myself, the result turns out to be different. I believe that there must be some misunderstandings in the steps of plotting the figure. Therefore, it would be so kind if you could help me to go into details on the steps of plotting the trends in the data before and after normalization (Methods 4.4 in the paper).

The method described in the paper is: > 4.4 Trends in the data before and after normalization [1] We grouped genes into six bins based on log10-transformed mean UMI count, using bins of equal width. To show the overall trends in the data, for every gene we fit the expression (UMI counts, scaled log-normalized expression, scaled Pearson residuals) as a function of log10-transformed mean UMI count using kernel regression (ksmooth function) with normal kernel and large bandwidth (20 times the size suggested by R function bw.SJ). For visualization, we only used the central 90% of cells based on total UMI. For every gene group, we show the expression range after smoothing from the first to third quartile at 200 equidistant cell UMI values

Figrue 1. Figures that show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper[1]).

Questions:

If I understand it correctly, genes are grouped genes into six bins based on log10-transformed mean UMI count. Do we need to regroup genes after log-normalization or sctransforms?

Is the x values for the kernel regression the log10-transformed total UMI count of each cell, which is, in other words, x = log10(PBMC$nCount_RNA)?

For doing kernel regressing on different types of data, like log-normalized data and Pearson residuals, are they using the same values of x which is the original total UMI count of each cell, or the new total counts based on newly calculated data, for example, the PBMC$nCount_SCT for Pearson residuals?

For plotting, Should I use the geom_smooth from ggplot2 to draw the colored region, or simply use the values of quartiles to mark the boundary of the region and filled it with a chosen color?

Will the shape(/trend) of the curve change after doing the z-score ?

Besides, I also notice that a small number of features are removed after the sctransform and I cannot find explanations about it online. Is there any automatic filtering in the sctransform?

ChristophH commented 2 years ago

Hi Michael,

Re: Fig1, please have a look at my comment at https://github.com/satijalab/sctransform/issues/34#issuecomment-525300128 - even if you are not familiar with R code, the comments might already help. If there are still specific questions that you have, I'll be happy to answer them.

On Tue, Apr 26, 2022 at 9:21 PM Michael D Caponegro < @.***> wrote:

I am copying the issue from sainadfensi https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sainadfensi&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=XCHYNDv-5VpEZb-B_ukfLFN8bn1dcPYSrzaQz2O9s20&e= summited on Aug 19, 2019 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_seurat_issues_1999-23issue-2D482593314&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=C-kgdQaRuauvXl2B3sEmnwbfqLeX1DRbQeqY7Ubt3kU&e= to the Seurat repo https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_seurat&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=D1xhpsmXuNyszCe9uxGQUkV2LfuVUfMbhiyFv4xmKOY&e= in regards to the x and y axes in Figures 1C, D and Fig 3A from Hafemeister, C., & Satija, R.

I can not find an answer or any information pertaining to the topic and I would also like to reproduce these figures.

Thank you

Dear Team,

I am learning the single-cell RNA-seq with the Seurat package and I highly appreciate the comprehensive package with interpretable visualizations and easy operation. Thank you for building up this package with such consideration and preciseness. Also, I value the new released Seurat v3 with the novel statistical approach for normalization called sctransform.

In the paper[1] about the sctransform, multiple statistical methods are applied to show the convincing effectiveness with loads of plots. In particular, I am interested in the figures in the paper which show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper). However, when I try to repeat the plot by myself, the result turns out to be different. I believe that there must be some misunderstandings in the steps of plotting the figure. Therefore, it would be so kind if you could help me to go into details on the steps of plotting the trends in the data before and after normalization (Methods 4.4 in the paper).

The method described in the paper is:

> 4.4 Trends in the data before and after normalization [1] We grouped genes into six bins based on log10-transformed mean UMI count, using bins of equal width. To show the overall trends in the data, for every gene we fit the expression (UMI counts, scaled log-normalized expression, scaled Pearson residuals) as a function of log10-transformed mean UMI count using kernel regression (ksmooth function) with normal kernel and large bandwidth (20 times the size suggested by R function bw.SJ). For visualization, we only used the central 90% of cells based on total UMI. For every gene group, we show the expression range after smoothing from the first to third quartile at 200 equidistant cell UMI values

[image: 63309022-89438800-c338-11e9-9ceb-2780659aa435] https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_36866996_165375942-2Dfc77ad26-2D388c-2D43d4-2D8a9f-2D427457af104f.png&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=KPyxwUb8qQcfBGxrJITtQhTDszjX4J4Lh_PUErIw1kY&e= [image: 63309026-8c3e7880-c338-11e9-99ff-7cb5f004b1cb] https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_36866996_165375956-2D22c6a567-2D4d0b-2D48e8-2Daaee-2D3959765d9ec5.png&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=42QcV7n_AuPfDyu6SiwoonlqIU-0CCmvS13-DmtZjJ4&e= Figrue 1. Figures that show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper[1]).

Questions:

If I understand it correctly, genes are grouped genes into six bins based on log10-transformed mean UMI count. Do we need to regroup genes after log-normalization or sctransforms?

Is the x values for the kernel regression the log10-transformed total UMI count of each cell, which is, in other words, x = log10(PBMC$nCount_RNA)?

For doing kernel regressing on different types of data, like log-normalized data and Pearson residuals, are they using the same values of x which is the original total UMI count of each cell, or the new total counts based on newly calculated data, for example, the PBMC$nCount_SCT for Pearson residuals?

For plotting, Should I use the geom_smooth from ggplot2 to draw the colored region, or simply use the values of quartiles to mark the boundary of the region and filled it with a chosen color?

Will the shape(/trend) of the curve change after doing the z-score ?

Besides, I also notice that a small number of features are removed after the sctransform and I cannot find explanations about it online. Is there any automatic filtering in the sctransform?

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_sctransform_issues_136&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=clhV8qJNn0k8zrnk_1IrvYVWx7biN23LkOIMoEt8Z1w&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AA4O4VMNZ65INJGHJ3YXA4DVHA64RANCNFSM5UM24K6Q&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=waba3pYHmPFWqfMvXc_0LKs0EqdgjiZrdy7f2aitkgA&e= . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mcap91 commented 1 year ago

Thank you

On Wed, Apr 27, 2022 at 12:35 AM Christoph Hafemeister < @.***> wrote:

Hi Michael,

Re: Fig1, please have a look at my comment at https://github.com/satijalab/sctransform/issues/34#issuecomment-525300128

even if you are not familiar with R code, the comments might already help. If there are still specific questions that you have, I'll be happy to answer them.

On Tue, Apr 26, 2022 at 9:21 PM Michael D Caponegro < @.***> wrote:

I am copying the issue from sainadfensi < https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_sainadfensi&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=XCHYNDv-5VpEZb-B_ukfLFN8bn1dcPYSrzaQz2O9s20&e=

summited on Aug 19, 2019 < https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_seurat_issues_1999-23issue-2D482593314&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=C-kgdQaRuauvXl2B3sEmnwbfqLeX1DRbQeqY7Ubt3kU&e=

to the Seurat repo < https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_seurat&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=D1xhpsmXuNyszCe9uxGQUkV2LfuVUfMbhiyFv4xmKOY&e=

in regards to the x and y axes in Figures 1C, D and Fig 3A from Hafemeister, C., & Satija, R.

I can not find an answer or any information pertaining to the topic and I would also like to reproduce these figures.

Thank you

Dear Team,

I am learning the single-cell RNA-seq with the Seurat package and I highly appreciate the comprehensive package with interpretable visualizations and easy operation. Thank you for building up this package with such consideration and preciseness. Also, I value the new released Seurat v3 with the novel statistical approach for normalization called sctransform.

In the paper[1] about the sctransform, multiple statistical methods are applied to show the convincing effectiveness with loads of plots. In particular, I am interested in the figures in the paper which show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper). However, when I try to repeat the plot by myself, the result turns out to be different. I believe that there must be some misunderstandings in the steps of plotting the figure. Therefore, it would be so kind if you could help me to go into details on the steps of plotting the trends in the data before and after normalization (Methods 4.4 in the paper).

The method described in the paper is:

> 4.4 Trends in the data before and after normalization [1] We grouped genes into six bins based on log10-transformed mean UMI count, using bins of equal width. To show the overall trends in the data, for every gene we fit the expression (UMI counts, scaled log-normalized expression, scaled Pearson residuals) as a function of log10-transformed mean UMI count using kernel regression (ksmooth function) with normal kernel and large bandwidth (20 times the size suggested by R function bw.SJ). For visualization, we only used the central 90% of cells based on total UMI. For every gene group, we show the expression range after smoothing from the first to third quartile at 200 equidistant cell UMI values

[image: 63309022-89438800-c338-11e9-9ceb-2780659aa435] < https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_36866996_165375942-2Dfc77ad26-2D388c-2D43d4-2D8a9f-2D427457af104f.png&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=KPyxwUb8qQcfBGxrJITtQhTDszjX4J4Lh_PUErIw1kY&e=> [image: 63309026-8c3e7880-c338-11e9-99ff-7cb5f004b1cb] < https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_36866996_165375956-2D22c6a567-2D4d0b-2D48e8-2Daaee-2D3959765d9ec5.png&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=42QcV7n_AuPfDyu6SiwoonlqIU-0CCmvS13-DmtZjJ4&e=

Figrue 1. Figures that show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper[1]).

Questions:

If I understand it correctly, genes are grouped genes into six bins based on log10-transformed mean UMI count. Do we need to regroup genes after log-normalization or sctransforms?

Is the x values for the kernel regression the log10-transformed total UMI count of each cell, which is, in other words, x = log10(PBMC$nCount_RNA)?

For doing kernel regressing on different types of data, like log-normalized data and Pearson residuals, are they using the same values of x which is the original total UMI count of each cell, or the new total counts based on newly calculated data, for example, the PBMC$nCount_SCT for Pearson residuals?

For plotting, Should I use the geom_smooth from ggplot2 to draw the colored region, or simply use the values of quartiles to mark the boundary of the region and filled it with a chosen color?

Will the shape(/trend) of the curve change after doing the z-score ?

Besides, I also notice that a small number of features are removed after the sctransform and I cannot find explanations about it online. Is there any automatic filtering in the sctransform?

— Reply to this email directly, view it on GitHub < https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_satijalab_sctransform_issues_136&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=clhV8qJNn0k8zrnk_1IrvYVWx7biN23LkOIMoEt8Z1w&e= , or unsubscribe < https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AA4O4VMNZ65INJGHJ3YXA4DVHA64RANCNFSM5UM24K6Q&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=e2kPyDtZAzQoyPfLTjZ5kw&m=8niBKFJ186KcO6Caa0SLx_y69GlMFR_0c9VYTvd4j7p47MVVPPWTHp3E6s1K7I7I&s=waba3pYHmPFWqfMvXc_0LKs0EqdgjiZrdy7f2aitkgA&e=

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/satijalab/sctransform/issues/136#issuecomment-1110651246, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZIXNCV3EVZDFV7OQMKYIDVHDU5TANCNFSM5UM24K6Q . You are receiving this because you authored the thread.Message ID: @.***>

satijalab / sctransform

Question about Reproducing a Plot in the Paper Introducing the SCTransform (Hafemeister, C., & Satija, R. #1999) #136

Re: Fig1, please have a look at my comment at https://github.com/satijalab/sctransform/issues/34#issuecomment-525300128