lhe17 / nebula

GNU General Public License v2.0
28 stars 6 forks source link

Filtering lowly expressed genes #42

Open ayyildizd opened 7 months ago

ayyildizd commented 7 months ago

First of all thank you for this nice tool. I run nebula for differential expression analysis between 2 groups and I realised that my top (by logFC) significant genes are mostly driven by some outlier cells (see the violin plots below). I think lowly expressed genes should be filtered out like in bulk RNA-seq methods, for example like in edgeR filterByExpr function that filters genes based on a minimum count required for at least some samples and minimum total count. Similarly, it would be useful to filter genes which does not reach certain thresholds per sample and maybe per group in order to control false positive DEGs. I was wondering what would be your suggestion regarding this?

image

In addition; a paper that uses nebula filters those genes afterwards from the differential gene expression results (i.e, genes that were expressed in at least 5% of cells of the compared groups were used for downstream analyses). I am not sure if keeping those lowly expressed genes in during the analysis would have a negative effect in the statistical calculations made within nebula. Do you suggest a gene filtering before (like bulk RNA-seq methods) or is it fine filtering them after running DE analysis with nebula?

lhe17 commented 7 months ago

Hi ayyildizd,

Thank you for your question.

In the current version of nebula, two filtering criteria are used to remove lowly-expressed genes. One is through the argument "cpc" (counts per cell defined as total number of counts/total number of cells), and the default value is 0.5%. The other is through the argument "mincp" (number of cells with a positive count) and the default value is 5. The default values are the minimum values I suggest. The optimal value depends on the data set.

Best regards, Liang

On Wed, Feb 21, 2024 at 10:10 AM ayyildizd @.***> wrote:

First of all thank you for this nice tool. I run nebula for differential expression analysis between 2 groups and I realised that my top (by logFC) significant genes are mostly driven by some outlier cells (see the violin plots below). I think lowly expressed genes should be filtered out like in bulk RNA-seq methods, for example like in edgeR filterByExpr function that filters genes based on a minimum count required for at least some samples and minimum total count. Similarly, it would be useful to filter genes which does not reach certain thresholds per sample and maybe per group in order to control false positive DEGs. I was wondering what would be your suggestion regarding this?

image.png (view on web) https://github.com/lhe17/nebula/assets/120032067/8de6a5df-9d95-49b9-95c8-bbef53b532b5

— Reply to this email directly, view it on GitHub https://github.com/lhe17/nebula/issues/42, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDISUQFTMWO3SKG5UBZBY3YUYE4ZAVCNFSM6AAAAABDTHLDKOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE2DMOJZGMYDONQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>