kdkorthauer / scDD

R package to identify genes with differential distributions in single-cell RNA-seq
32 stars 15 forks source link

classify DD genes #12

Closed ahy1221 closed 5 years ago

ahy1221 commented 6 years ago

Hi, I am very impressed by the idea that to categorized traditional DD genes for single cell analysis.

Since scDD is quite computational intensively, I am wondering that is that possible to perform classification for DD genes from the output of other DE tools (limma, Mast etc.) using some scDD functions?

I noticed that the paper Table 2 having different DE category from MAST and SCDE. It looks like there should be some function do such classification just based on output of traditional DE tools.

I tried my best to understand the method to perform classification from the method section of your paper and the source code. I find the function which might be associated with what I want to do , partition estimation mclustRestriced() and classification classifyDD(). However, I still do not know how to use them.

I would be appreciated if you can give me any advice. Thank you very much.

kdkorthauer commented 6 years ago

Hi @ahy1221,

Thanks for your interest in scDD. That's a great question. It should be possible to add an exported function to classify the genes from output of other tools into patterns. I will look into doing so if you would find that useful. However, I want to point out that if you set permutations = 0 in the scDD function, the method will use a fast approximation to the full permutation testing framework. This should be much faster (although not as fast as tools that don't explicitly consider multi-modal expression distributions).

Best, Keegan

ChelseaCHENX commented 5 years ago

Hi! I have been looking long and hard how to model gene expression heterogeneity within cell groups - until I found your publications, which is great.

But concerning my practical needs, I actually have the same question with ahy1221- even more general:

Given only one condition, I want to know the modality (number of clusters/components, k) of a certain gene, the mean & variances of each components and the corresponding p value which tells me how reliable the expression of this gene could be modeled with k components mixture.

Could any functions in the package be modified to do this task?

kdkorthauer commented 5 years ago

Hi @ChelseaCHENX,

Given only one condition, if you'd like to cluster the expressed cells (with nonzero expression for that gene), you can use the scDD:::mclustRestricted function. Here is an example to cluster a single gene (gene number 400) in one condition (condition 1) of the scDatEx example data in the scDD package:

suppressMessages(library(scDD))
suppressMessages(library(SingleCellExperiment))

# load example data
data(scDatEx)

# get gene 400 for condition 1
y <- normcounts(scDatEx)[400, colData(scDatEx)$condition == 1]

# cluster nonzero cells
out <- scDD:::mclustRestricted(log(y[y>0]), min.size = 4)
out
#> $class
#> C1.073 C1.074 C1.076 C1.077 C1.078 C1.079 C1.080 C1.082 C1.083 C1.084 
#>      2      1      1      1      2      2      1      1      1      2 
#> C1.086 C1.087 C1.088 C1.089 C1.090 C1.091 C1.092 C1.093 C1.094 C1.096 
#>      1      2      1      2      2      2      1      2      1      2 
#> C1.097 C1.098 C1.099 C1.100 C1.101 C1.102 C1.103 C1.104 C1.105 C1.106 
#>      1      2      1      1      1      2      1      2      2      1 
#> C1.107 C1.108 C1.109 C1.110 C1.111 C1.112 C1.113 C1.114 C1.115 C1.116 
#>      2      2      1      2      1      2      2      1      2      1 
#> C1.117 C1.118 C1.120 C1.121 C1.122 C1.123 C1.124 C1.125 C1.126 C1.127 
#>      2      2      2      2      1      1      2      1      1      2 
#> C1.128 C1.129 C1.130 C1.132 C1.133 C1.134 C1.135 C1.136 C1.137 C1.138 
#>      2      2      2      2      1      1      2      2      2      2 
#> C1.139 C1.140 C1.141 C1.142 C1.144 C1.145 C1.146 C1.147 C1.148 C1.149 
#>      1      1      2      1      1      2      1      1      1      1 
#> C1.150 C1.151 C1.153 C1.154 
#>      1      2      1      2 
#> 
#> $mean
#>        1        2 
#> 2.264872 4.958313 
#> 
#> $var
#> [1] 1.5567037 0.3483346
#> 
#> $df
#> [1] 5
#> 
#> $n
#> [1] 74
#> 
#> $bic
#> [1] -287.2596
#> 
#> $loglik
#> [1] -132.8696
#> 
#> $model
#> [1] "V"

Created on 2019-06-12 by the reprex package (v0.3.0)

This particular gene has two components. The results object lists each cell's cluster membership, cluster means, cluster variances, the sample size, and the BIC/log likelihood of the normal mixture model. For more details, refer to the man page ?scDD:::mclustRestricted.

Please let me know if you have any questions.

Best, Keegan

ChelseaCHENX commented 5 years ago

Thank you very much Keegan! That is very helpful!

I have two small questions here:

Really appreciate your help! Chelsea

kdkorthauer commented 5 years ago

Hi Chelsea,

Yes, you are correct that mclust selects the number of components with the lowest BIC. However, the number of components selected by this function may be different, as I also require additional evidence of bimodality (since the number of components is not exactly the number of modes and in this application I am focused on the number of modes - e.g. if the means are close enough together a 2-component model will not be bimodal).

This leads into your second question, and the answer is that unfortunately there is no definitive answer. This function implements an ad-hoc 'restricted' clustering that looks at the bimodal index in addition to BIC - you can read more details in the scDD paper and supplement (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1077-y).

Best, Keegan

erikadudki commented 4 years ago

Hi, thanks for these posts, this helped a lot as I had the same question. As far as I understood, mclustRestricted gives results of the best optimized BIC-value, would it also be possible to get results of the best 5 or 10 BIC-results? What if there exist multiple nearly similar BIC results, but with different number of modes? Thanks and Best, Erika

kdkorthauer commented 4 years ago

You are correct that the best optimized BIC-value is used, allowing up to 5 components, but with the added penalty that checks that additional components are well-separated.

We haven't found much utility looking beyond 5 mixture components - this would mean more than 5 distinct modes of gene expression in a given gene.

Best, Keegan

erikadudki commented 4 years ago

Thanks for the very fast reply! That makes sense with the 5 mixture components, I also agree that is enough. What I meant was, is it possible to view the results of the 2nd best BIC value? And the 3rd best BIC value? For example, for one gene, the best BIC result gives me a classification of two modes, maybe there exist a 2nd best BIC value that is very similar but sligthly worse to the first BIC value but predicts only one mode. Is something like this possible? In that way, looking at the different results of the BIC values, I could infer how certain the first BIC classification is. Does that make sense?

Best, Erika

kdkorthauer commented 4 years ago

Got it -- thanks for the clarification. This is something I gave a lot of thought to, and is the reason for the bimodality index penalty. The goal is that when moving from K to K+1 modes, even if the BIC at K+1 modes is slightly better, we will only settle on K+1 if the bimodality index is also high enough (i.e. if the component means are well-separated enough). Otherwise, stay with the smaller model and second best BIC choice (K modes). However, the way this is done is admittedly ad-hoc and there's certainly room for improvement. What I have implemented was aimed at being (1) a bit on the conservative side for picking higher number of modes for only small improvements in BIC and (2) automatic, to avoid the necessity of examining BIC values for each gene.

As currently written, the alternate BIC values are not saved with the results.