Closed alexvpickering closed 5 years ago
I'm not sure what to do about this. This is definitely an example of a batch effect that is perfectly confounded with the experimental condition.
I played around a bit with centering each sample separately (for each gene, subtracting the median count for the sample) which worked quite well. Instead of getting nonsense results (e.g. several haemoglobin genes best distinguish diseased and healthy macrophages in SJIA) I got reasonable results (S100A8, S100A9, and S100A12 ranking at 3, 6, and 9 respectively).
I'm a bit uncomfortable doing the above just because it doesn't seem to be standard to do before differential gene expression analysis across samples.
I have a hypothesis as to the cause of the above:
RBCs are the dominant cell type in the healthy sample:
They are probably thus also the most commonly lysed cell in the healthy sample - spewing their RNAs into the media and leading to high background haemoglobin gene expression in the healthy sample.
For the diseased sample, there is very high immunoglobulin expression that dominates all samples. This doesn't quite match as B-cells are not the most common cell type in the diseased sample (but possible are the most commonly lysed or immunoglobulins are just very highly expressed).
Make sense to me,. Have seen this a lot in bulk tissue RNA samples.
On Jun 14, 2019, at 2:02 PM, Alex Pickering notifications@github.com wrote:
I have a hypothesis as to the cause of the above:
RBCs are the dominant cell type in the healthy sample:
https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_15719520_59527756-2D4923dd00-2D8e91-2D11e9-2D9fbd-2D466b4cd35529.png&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=HXYy3VSm_qaXSNjovZNRV8K3RHfkeME8Ja3wHjckZK4&e= They are probably thus also the most commonly lysed cell in the healthy sample - spewing their RNAs into the media and leading to high background haemoglobin gene expression in the healthy sample.
For the diseased sample, there is very high immunoglobulin expression that dominates all samples. This doesn't quite match as B-cells are not the most common cell type in the diseased sample (but possible are the most commonly lysed or immunoglobulins are just very highly expressed).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_drugseqr_issues_76-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACICXBEMEGNMDKTV7NQU2D3P2PMLFA5CNFSM4HX4DMS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXXRKBQ-23issuecomment-2D502207750&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=Po43ExbhG1PSTqwC5vF4RU52ryodcSjILrZMTDF8ywI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACICXBCDCUAPFWX6AK6HRBLP2PMLFANCNFSM4HX4DMSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=b56SuCCRgVNwQumn-VYhnkiiEy4eKmRWF1hQrHxSlcI&s=hc6Zt_l0wj7dkcU88_O6FUUUAJ_DOVWUuf5K0zIZSFc&e=.
Make sense to me,. Have seen this a lot in bulk tissue RNA samples.
Good to hear that may not just be a silly mistake on my part. Any experience as to what is generally done to mitigate this source of bias? Might help me find the equivalent literature for scRNA-seq analyses.
Zero centering each gene within each sample seems reasonable but feels too simple ... it would at least be nice to see that other people are doing it as well.
Found something! SoupX
I was looking but had found nothing. This looks great. Your search skills (among others) are impressive. -Zak
On Jun 14, 2019, at 4:54 PM, Alex Pickering notifications@github.com wrote:
Found something! SoupX https://urldefense.proofpoint.com/v2/url?u=https-3A__www.biorxiv.org_content_biorxiv_early_2018_04_20_303727.full.pdf&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=FqsUgvVl14Qsr54a01j6zdc0HrrANnpTHuURIuVS1qM&e= — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_hms-2Ddbmi_drugseqr_issues_76-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DACICXBGFZF7ZPNBWEHY6RZDP2QAP3A5CNFSM4HX4DMS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXX6BXI-23issuecomment-2D502259933&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=W1PKpk3NiQrPaDEGqO0OdkkjqLlx4XCqI4HVEPe-ubs&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ACICXBEGVYNXBXZIIAGMUGDP2QAP3ANCNFSM4HX4DMSQ&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LGvMyVydq3L28lQxe97sG_94kjwVf2ra9cq7q2wvXa0&m=JLdWZbpdSD5HqAn4L0qFkFEb6QfqwcUIyq2TSnwtgbU&s=WrK8C33NgsZeBCo8iMwwWI3H8ud_8uVKZlUVKUF2oKE&e=.
Here is another approach that has been used:
This prevents appropriate cross-sample differential expression analyses. For example, in the healthy SJIA lung sample, background HBB expression is very high:
In contrast, the diseased sample has much lower background HBB expression:
As a result, comparing any cell populations between healthy and diseased samples results in HBB being significant.
I am going to see if using the
@scale.data
slot (which is scaled) for differential expression fixes this - described here (requiresdevel
release of Seurat). This is also the optimal approach as described by the authors of Seurat: