Seurat SCTransform corrected log-normalized counts are not scaled appropriately

hms-dbmi / dseqr

single-cell and bulk RNA-seq analyses from counts → pathways → drug candidates.

https://docs.dseqr.com

Other

20 stars 4 forks source link

Seurat SCTransform corrected log-normalized counts are not scaled appropriately #76

Closed alexvpickering closed 5 years ago

alexvpickering commented 5 years ago

This prevents appropriate cross-sample differential expression analyses. For example, in the healthy SJIA lung sample, background HBB expression is very high:

In contrast, the diseased sample has much lower background HBB expression:

As a result, comparing any cell populations between healthy and diseased samples results in HBB being significant.

I am going to see if using the @scale.data slot (which is scaled) for differential expression fixes this - described here (requires devel release of Seurat). This is also the optimal approach as described by the authors of Seurat:

You can use the corrected log-normalized counts for differential expression and integration. However, in principle, it would be most optimal to perform these calculations directly on the residuals (stored in the scale.data slot) themselves. This is not currently supported in Seurat v3, but will be soon.

alexvpickering commented 5 years ago

I'm not sure what to do about this. This is definitely an example of a batch effect that is perfectly confounded with the experimental condition.

I played around a bit with centering each sample separately (for each gene, subtracting the median count for the sample) which worked quite well. Instead of getting nonsense results (e.g. several haemoglobin genes best distinguish diseased and healthy macrophages in SJIA) I got reasonable results (S100A8, S100A9, and S100A12 ranking at 3, 6, and 9 respectively).

I'm a bit uncomfortable doing the above just because it doesn't seem to be standard to do before differential gene expression analysis across samples.

alexvpickering commented 5 years ago

I have a hypothesis as to the cause of the above:

RBCs are the dominant cell type in the healthy sample:

They are probably thus also the most commonly lysed cell in the healthy sample - spewing their RNAs into the media and leading to high background haemoglobin gene expression in the healthy sample.

For the diseased sample, there is very high immunoglobulin expression that dominates all samples. This doesn't quite match as B-cells are not the most common cell type in the diseased sample (but possible are the most commonly lysed or immunoglobulins are just very highly expressed).

ikohane commented 5 years ago

Make sense to me,. Have seen this a lot in bulk tissue RNA samples.

On Jun 14, 2019, at 2:02 PM, Alex Pickering notifications@github.com wrote:

I have a hypothesis as to the cause of the above:

RBCs are the dominant cell type in the healthy sample:

https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_15719520_59527756-2D4923dd00-2D8e91-2D11e9-2D9fbd-2D466b4cd35529.png&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=HXYy3VSm_qaXSNjovZNRV8K3RHfkeME8Ja3wHjckZK4&e= They are probably thus also the most commonly lysed cell in the healthy sample - spewing their RNAs into the media and leading to high background haemoglobin gene expression in the healthy sample.

For the diseased sample, there is very high immunoglobulin expression that dominates all samples. This doesn't quite match as B-cells are not the most common cell type in the diseased sample (but possible are the most commonly lysed or immunoglobulins are just very highly expressed).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_drugseqr_issues_76-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACICXBEMEGNMDKTV7NQU2D3P2PMLFA5CNFSM4HX4DMS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXXRKBQ-23issuecomment-2D502207750&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=Po43ExbhG1PSTqwC5vF4RU52ryodcSjILrZMTDF8ywI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACICXBCDCUAPFWX6AK6HRBLP2PMLFANCNFSM4HX4DMSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=hc6Zt_l0wj7dkcU88_O6FUUUAJ_DOVWUuf5K0zIZSFc&e=.

alexvpickering commented 5 years ago

Make sense to me,. Have seen this a lot in bulk tissue RNA samples.

Good to hear that may not just be a silly mistake on my part. Any experience as to what is generally done to mitigate this source of bias? Might help me find the equivalent literature for scRNA-seq analyses.

Zero centering each gene within each sample seems reasonable but feels too simple ... it would at least be nice to see that other people are doing it as well.

alexvpickering commented 5 years ago

Found something! SoupX

ikohane commented 5 years ago

I was looking but had found nothing. This looks great. Your search skills (among others) are impressive. -Zak

On Jun 14, 2019, at 4:54 PM, Alex Pickering notifications@github.com wrote:

Found something! SoupX https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biorxiv.org_content_biorxiv_early_2018_04_20_303727.full.pdf&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=FqsUgvVl14Qsr54a01j6zdc0HrrANnpTHuURIuVS1qM&e= — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_drugseqr_issues_76-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACICXBGFZF7ZPNBWEHY6RZDP2QAP3A5CNFSM4HX4DMS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXX6BXI-23issuecomment-2D502259933&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=W1PKpk3NiQrPaDEGqO0OdkkjqLlx4XCqI4HVEPe-ubs&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACICXBEGVYNXBXZIIAGMUGDP2QAP3ANCNFSM4HX4DMSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=WrK8C33NgsZeBCo8iMwwWI3H8ud_8uVKZlUVKUF2oKE&e=.

alexvpickering commented 5 years ago

Here is another approach that has been used:

identify ambient genes from droplets with ~10 UMIs
exclude ambient genes in differential expression tests unless they are a marker gene for that cluster