stephens999 / ashr

An R package for adaptive shrinkage
GNU General Public License v3.0
79 stars 35 forks source link

ash_pois log link problems #114

Open aksarkar opened 4 years ago

aksarkar commented 4 years ago

On simulated NB data (https://users.rcc.uchicago.edu/~aksarkar/pois-mode-est.Rds), fitting with log link gets a much worse log likelihood than with identity link.

> dat <- readRDS("/scratch/midway2/aksarkar/modes/test-data/pois-mode-est.Rds")
> with(dat, summary(MASS::glm.nb(x ~ offset(log(s))))$twologlik / 2)
[1] -2008.504
> fit0 <- ashr::ash_pois(dat$x, dat$s, link="identity", mixcompdist="halfuniform")
> fit0$loglik
[1] -2008.271
> lam <- dat$x / dat$s
> eps = 1 / mean(dat$s)
> log_lam <- log(lam + eps)
> se_log_lam <- sqrt(var(lam) / (lam + eps)^2)
> mixsd <- seq(.1 * min(se_log_lam), max(2 * sqrt(log_lam^2 - se_log_lam^2)), by=.5 * log(2))
> fit1 <- ashr::ash_pois(dat$x, dat$s, link="log", mixcompdist="halfuniform", mixsd=mixsd)
> fit1$loglik
[1] -3199.636

autoselect.mixsd needs to be modified to handle this case. For se_log_lam, I am using Taylor expansion of V[ln(\lambda + \epsilon)]. This calculation also needs to be verified.

On real data (https://users.rcc.uchicago.edu/~aksarkar/b-cell-data.Rds), the marginal PMF calculation assuming log link returns probabilities > 1, which should not happen.

> dat <- readRDS("/scratch/midway2/aksarkar/modes/test-data/b-cell-data.Rds")
> query <- dat[dat["gene"] == "ENSG00000019582",]
> with(query, summary(MASS::glm.nb(count ~ offset(log(size))))$twologlik / 2)
[1] -31414.6
> fit0 <- ashr::ash_pois(query$count, query$size, link="identity", mixcompdist="halfuniform")
> fit0$loglik
[1] -31395.05
> lam <- query$count / query$size
> eps = 1 / mean(query$size)
> log_lam <- log(lam + eps)
> se_log_lam <- sqrt(var(lam) / (lam + eps)^2)
> mixsd <- seq(.1 * min(se_log_lam), max(2 * sqrt(log_lam^2 - se_log_lam^2)), by=.5 * log(2))
> fit1 <- ashr::ash_pois(query$count, query$size, link="log", mixcompdist="halfuniform", mixsd=mixsd)
> fit1$loglik
[1] 10408.96
> any(fit1$fitted_g$pi %*% ashr:::comp_dens_conv(fit1$fitted_g, fit1$data) > 1)
[1] TRUE
any(ashr:::comp_dens_conv(fit1$fitted_g, fit1$data) > 1)
[1] TRUE
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /scratch/midway2/aksarkar/miniconda3/envs/scmodes/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] MASS_7.3-50       compiler_3.5.1    Matrix_1.2-14     parallel_3.5.1   
 [5] mixsqp_0.2-2      Rcpp_0.12.18      codetools_0.2-15  SQUAREM_2017.10-1
 [9] pscl_1.5.2        doParallel_1.0.11 grid_3.5.1        iterators_1.0.10 
[13] foreach_1.4.4     truncnorm_1.0-8   ashr_2.2-39       lattice_0.20-35