stephens999 / ashr

An R package for adaptive shrinkage
GNU General Public License v3.0
79 stars 35 forks source link

Real data departs from Poisson ash fitted distribution #116

Closed aksarkar closed 4 years ago

aksarkar commented 4 years ago

See the data plotted, and some preliminary work on investigating this problem, here:

https://users.rcc.uchicago.edu/~aksarkar/singlecell-modes/gof.html#examples

The data are available at:

https://users.rcc.uchicago.edu/~aksarkar/b-cell-data.Rds

dat <- readRDS("b-cell-data.Rds")
query <- dat[dat$gene == "ENSG00000177954",]
fit <- with(query, ash_pois(count, size, mixcompdist = "halfuniform"))

@pcarbo found Mosek sometimes fails (2/14 times) on this data set

Error in REBayes::KWDual(L, d, w, ...) :
  Mosek error:  MSK_RES_TRM_STALL: The optimizer is terminated due to slow progress.

When Mosek failed, mixsqp was able to find a solution. When Mosek did not fail, its solution was nearly the same as the mixsqp solution

aksarkar commented 4 years ago

@stephens999 I ported my implementation of the GOF test to R for your convenience

# Compute marginal CDF of data under Poisson-unimodal model, where the unimodal
# distribution is represented as a collection of uniform segments
ash_cdf <- function(x, s, fit, thresh=1e-8) {
  a <- fit$fitted_g$a
  b <- fit$fitted_g$b
  pi <- fit$fitted_g$pi
  N <- length(x)
  K <- length(a)
  F <- matrix(rep(0, N * K), N, K)
  for (i in seq(N)) {
    for (k in seq(K)) {
      if (pi[k] < thresh) {
        next
      }
      else if (x[i] < 0) {
        F[i,k] = 0
      }
      else if (a[k] == b[k]) {
        F[i,k] = ppois(x[i], s[i] * a[k])
      }
      else {
        ak = min(a[k], b[k])
        bk = max(a[k], b[k])
        F[i,k] = sum(pgamma(bk, shape=seq(x[i] + 1), rate=s[i]) - pgamma(ak, shape=seq(x[i] + 1), rate=s[i])) / (s[i] * (bk - ak))
      }
    }
  }
  F %*% pi
}

# Compute marginal PMF of data under Poisson-unimodal model, where the unimodal
# distribution is represented as a collection of uniform segments
ash_pmf <- function(x, fit) {
  t(fit$fitted_g$pi %*% ashr:::comp_dens_conv(fit$fitted_g, fit$data))
}

# Test fitted unimodal expression model for goodness of fit to observed data
gof <- function(x, s, fit) {
  Fx_1 <- ash_cdf(x - 1, s, fit)
  fx <- ash_pmf(x, fit)
  u <- runif(n=length(x))
  rpp <- Fx_1 + u * fx
  ks.test(rpp, punif)
}

# Test case for implementation
set.seed(1)
n <- 100
s <- rep(1e5, n)
log_mean <- -6
log_inv_disp <- 4
lam <- rgamma(n=n, shape=exp(log_inv_disp), scale=exp(log_mean - log_inv_disp))
x <- rpois(n=n, lambda=s * lam)
fit <- ashr::ash_pois(x, scale=s, link='identity', mixcompdist='halfuniform')
Fx_1 <- ash_cdf(x - 1, s, fit)
Fx <- ash_cdf(x, s, fit)
fx <- ash_pmf(x, fit)
stopifnot(all.equal(Fx_1 + fx, Fx))
gof(x, s, fit)

# Real data example
dat <- readRDS('/home/aksarkar/public_html/b-cell-data.Rds')
query <- dat[dat$gene == 'ENSG00000177954',]
fit <- with(query, ashr::ash_pois(count, size, mixcompdist='halfuniform'))
with(query, gof(count, size, fit))
aksarkar commented 4 years ago

This example appears to violate the independence assumption on the prior made in Poisson ash.