overdispersion - Githubissues

koenvandenberge commented 6 years ago

Hi,

PROSSTT looks like a great tool. I am currently using it to evaluate trajectory-based analysis tools. One of the things I am looking into is on the estimation of the variability across the lineage. In the PROSSTT paper, you mention that

$Var(Y_g) = \alpha_g (s_n \mu_g)^2 + \beta_g (s_n \mu_g) $

where $Y_g$ are counts for gene $g$.

In my simulation with PROSSTT in Python, I set alpha4 = np.exp(random.normal(loc=np.log(0.2), scale=np.log(1.5), size=t.G)) beta4 = np.repeat(1,400)

to conform to the variance of a negative binomial : mu + phi * mu^2, where phi is the overdispersion parameter. I hence simulate using X4, labs4, brns4, scalings4 = sim.sample_whole_tree(t, 1, alpha=alpha4, beta=beta4)

and fit trajectories based on X4. Am I correct to assume that, in this setting, alpha4 can be considered to be the true dispersion parameter? Or am I missing something? I am asking this because the true overdispersion as defined above does not seem to be very correlated with the estimates from other tools.

galicae commented 6 years ago

Hey Koen,

thanks for giving PROSSTT a shot! People using it to look at trajectory-based tools is one of the intended use cases, so that makes me really happy :grinning:

In the formulation that you are using here alpha would be the overdispersion parameter, equivalent to your phi. I am not sure what tools you are comparing with, and what your analysis is like. You are treating the Poisson (beta = 1) noise as "technical" and so whatever comes from the alpha is the "true" dispersion?

I will take a look at this nonetheless, would be great if you can give me more details. Thanks again for your interest!

koenvandenberge commented 6 years ago

Many thanks for the swift response.

As an example, I was doing some quality control of the simulated data based on the dispersion using edgeR R package. If you are not familiar, edgeR is one of the most popular and state-of-the-art tools for differential expression analysis for bulk RNA-seq data.

I simulate a simple bifurcating trajectory, and you can find a plot of the estimated trajectories (with slingshot) here: https://www.dropbox.com/sh/0mehat1kxdk4ous/AAAGDoe8lDEc9Dpvjt11MqBIa?dl=0 Next, I fit an edgeR model where the design is ~ cluster , i.e. my design matrix represents the three different clusters found in the reduced PC space. Given the simpler trajectory, I am assuming that this should be sufficient to more or less capture the majority of systematic variability. I then compare the edgeR estimated dispersions with the ones I used as input to PROSSTT, which is the right hand plot in the Dropbox file.

galicae commented 6 years ago

hmmm this is interesting. There are a couple of things that I would try here:

you can run the simulation without scalings (so with equisized cells) to see if the effect comes from there. I assume you normalized before the analysis, but it might be worth a shot.
more importantly, I would run the simulation with more cells, to see if the effect persists.

and even more importantly: as far as I can tell, edgeR is meant to analyze static populations. The examples in their user guide (chapter 4, case studies), are all for biological systems where there is no differentiation. In these simulations you cannot assume that the average gene expression is the same within each cluster, and I think this confuses the variance analysis. It would be interesting to see what happens if you bin cells according to average expression and then perform the same analysis, and I suspect it would look much better.

cheers, Niko

edit: also, while I am not intimately familiar with edgeR, it seems to me like a tool intended for bulk would not be particularly suited to single-cell data. I guess you could calculate averages from the cells of each branch and given enough cells even define replicates, but it seems to me like a bulk workflow is not particularly helpful in this context.

koenvandenberge commented 6 years ago

Dear Niko, Thanks for the suggestions. Running the simulation without scaling indeed improves the estimation considerably; I have added an extra plot in the Dropbox.

Considering your comment on edgeR; I think that it is reasonable to use it for this evaluation. I agree that the optimal model would allow for continuous transitions in gene expression using e.g. smooth functions, but given that the trajectory under investigation is pretty simple, the systematic variability may be reasonably explained with three groups. I would not use this as some sort of final analysis for e.g. differential expression analysis, but I think that it does the job when I just want to do some quick checks. Also, I have specifically set beta=1 in PROSSTT to conform to a negative binomial distribution, since edgeR assumes a negative binomial model, so statistically there should be not too big of a mismatch.

Thanks again, Koen

galicae commented 6 years ago

Hey Koen, glad it helped :) Good point about edgeR - a lot of the functionality might even be generic enough to be reusable. I hope PROSSTT can be useful for you!

Something that is bothering me though: what do you mean "conform to a negative binomial distribution"? Negative binomial distributions are usually described as a series of Bernoulli experiments, and as such you have a parameter p (success/failure prob. of the Bernoulli experiment) and a parameter r (number of successes/failures until we stop Bernoulli'ing). The parametrization you describe here (variance = mu + phi * mu^2) is just imposing certain restrictions on the values of p, r; it certainly is not the only available option for negative binomial distributions. PROSSTT does use negative binomials for count distributions, just some that are more generic (i.e. allowing a coefficient for the linear effect as well).

koenvandenberge commented 6 years ago

I think it is just a way of perspective. If you would allow beta to be different from one, I would call the distribution a "quasi negative binomial". I wanted to avoid this, since this is not what is assumed by the edgeR model that I have fitted (although note that they have a quasilikelihood method [1] that does this), and I only wanted to check the estimation of alpha.

[1] https://www.degruyter.com/view/j/sagmb.2012.11.issue-5/1544-6115.1826/1544-6115.1826.xml

galicae commented 6 years ago

at the end of the day you wanted variance that matched the assumptions of your analysis - I am happy that PROSSTT could provide that for you. We are bound to meet at one conference or the other, let's resume the terminology discussion there :)

soedinglab / prosstt

overdispersion #9