kstreet13 / scry

Collection of analysis methods for small count data generated by rafalab members (the methods, not the data).
19 stars 2 forks source link

Including covariates with nullResiduals() #13

Open johnchamberlin opened 4 years ago

johnchamberlin commented 4 years ago

Is it possible to include gene or cell covariates when using the nullResiduals approximation of glc-pca (i.e. X and Z)? I am working with a large dataset and am trying to speed up glm-pca by using nullResiduals instead. By cell covariate I mean a computed value, not a batch identity.

Thanks, John

willtownes commented 4 years ago

Hi thanks for your interest. So far this is not supported in scry; we only allow adjusting for cell covariates that are single categorical variables. The reason is because that is the scenario where the residuals can be computed in closed-form. In the future we might provide this feature but it will be much slower because it will require an iterative optimization procedure (basically equivalent to a Poisson regression) for each gene. Here are some possible alternative approaches in the meantime (in decreasing order of favorability):

  1. Try using the sctransform package, which uses negative binomial residuals instead of Poisson or binomial. I believe it supports arbitrary cell covariates. There is a slight difference in that it uses log-total count as a covariate instead of an offset by default and doesn't support deviance residuals (only Pearson), but neither of those should matter much. The main advantage scry has over sctransform is speed and simplicity, but once we start having non-categorical covariates that would go away.
  2. If your goal is to eventually apply PCA to the residuals, you could ignore the covariates and compute the ordinary scry residuals, apply PCA to them, then regress off your covariates from the resulting principal component values. It may be of interest to compare this to the harmony method.
  3. You could implement the above Poisson regression procedure yourself using the glm function from regular R (or some faster equivalent). Compute the log of total counts for each cell as an offset. Then for each gene (independently) regress the vector of gene counts y_j against your covariates matrix X with the offsets and the Poisson glm family with log link. Then take the deviance residuals from the fitted model. That should be equivalent to what scry provides in the case that X encodes a single categorical covariate and/or just an intercept term.
willtownes commented 4 years ago

oh one more thing, if you have tried glmpca in the past and been disappointed with speed, please make sure you have tried the newer v0.2 release (currently on CRAN), where we have increased the speed with a new optimization algorithm. Of course, it is still slower than computing residuals with scry.

johnchamberlin commented 4 years ago

I just started using glmpca yesterday. It runs in [edit: 23 minutes] with 40K cells, 2K features, L=10, which I think is perfectly adequate. But I wanted to produce the deviance matrix as well, which is when I switched to scry.

Can I use deviances to perform differential expression, similar to sctransform, or is the idea simply to use glm-pca (or PCA on deviances) ahead of clustering and then revert to raw counts?

Thanks

willtownes commented 4 years ago

I suppose you could use the deviance residuals for differential expression, but we would recommend using the raw counts instead along with a tool like DEseq2 or edgeR. Soneson & Robinson 2018 may be relevant.

By the way, I'm not sure of your use case for deviance residuals other than differential expression, but if you have already run GLM-PCA, you can use the predict.glmpca method to get a matrix of the imputed mean values for original data matrix. This is sort of like a denoised version of the data. If you were to normalize and log-transform it (there won't be any exact zeros), it should have similar properties as the deviance residuals matrix you would get from scry.