Input to sctansform, scaled UMI counts or direct counts

satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression

GNU General Public License v3.0

203 stars 33 forks source link

Input to sctansform, scaled UMI counts or direct counts #101

Closed Bonder-MJ closed 3 years ago

Bonder-MJ commented 3 years ago

Hello,

I have been trying to use sctransform on our 10X data. I would like to clarify what the exact input should be. From the example vignette it seems that these sound be direct UMI counts. However when I try to apply the method on a much deeper sequenced sample (more UMIs) I find that sctransform returns strange relations between the geometric mean and residual variance as plotted in the example code.

Given the large difference in sequencing depth, and the mean variance relationship as well as the log mean and detection fraction plots, I decided to scale my data to UMI per 10,000. After this transformation the method seems to work as expected. However, given that this is not documented at all in the example, the input seems to be direct UMI counts, I wanted to double check that this is the right path forward.

Thanks!

ChristophH commented 3 years ago

What exactly do you mean by 'strange relations'? Would you mind posting the estimated and fitted model parameters? plot_model_pars(vst_out, show_theta = TRUE, show_var = TRUE) And just as a sanity check, could you use method = 'glmGamPoi' and method = 'qpoisson' when you call vst? The model parameters should look very similar regardless of the method used.

We have used sctransform on a wide range of 10X data including some deeply sequenced one and did not notice major problems. How many cells are in your data set? How many genes detected, and how many UMI per cell?

Scaling the counts as a pre-processing step is NOT recommended. As we argue in the paper (and others as well), per-cell scaling factors are not a good idea, since the effect on gene expression is different depending on gene mean.

Bonder-MJ commented 3 years ago

Dear Christoph,

Thanks for the fast reply. From your paper and vignettes we also expected that we should work with direct UMI counts but we were surprised by the results.

We are using 13K single cells from one of the cell types of this paper (https://www.nature.com/articles/s41588-021-00801-6). I don't have the estimated parameters at hand but I will recalculate them as soon as I can.

Please find attached the strange relation between residual variance and geometric mean I was talking about. We see a strange non-linear relation between the two. After scaling to 1/10000 UMIs or scaling by total UMIs per sample we see that this relation is no longer there and the residual variance is much much lower (max 150).

sctransform_residual_var_hist_FPP_D11

sctransform_gmean_vs_residual_var_FPP_D11

ChristophH commented 3 years ago

I have a feeling that these data have already been pre-processed in some way. If you are able to share the matrix that you use as input to vst I'd be happy to take a look.

Note that the data on Zenodo apparently contains "scRNA-seq normalised counts".

Bonder-MJ commented 3 years ago

We found the issue. We thought we had properly transformed back the data as was explained in the zenodo package. But the transformation back to the original count space didn't work as we expected. Thanks for your help and sorry for bothering!

ChristophH commented 3 years ago

The data on Zenodo is indeed normalized. I looked at D11.h5 and all cells have been scaled to a total of 13,003. For sample_id == 0 I reversed the normalization assuming that the lowest non-zero value was a one. When I use the resulting UMI count matrix (all integer counts) in vst output looks as expected.

E.g.

vst_out <- vst(umi = counts, return_cell_attr = TRUE, method = 'qpoisson', bw_adjust = 1.5)

plot_model_pars(vst_out, show_theta = TRUE, show_var = TRUE)

ga <- tibble::rownames_to_column(vst_out$gene_attr, var = 'gene') %>% 
  mutate(highlight = rank(-residual_variance) <= 25)

ggplot(ga, aes(log10(gmean), sqrt(residual_variance), label = gene)) + 
  geom_point() + geom_smooth() +
  geom_text_repel(data = filter(ga, highlight), size = 3, color = 'red') +
  geom_point(data = filter(ga, highlight), size = 0.66, color = 'red') +
  ggtitle('Residual variance as function of gene mean')

Bonder-MJ commented 3 years ago

Thanks again!