Closed Bonder-MJ closed 3 years ago
What exactly do you mean by 'strange relations'? Would you mind posting the estimated and fitted model parameters?
plot_model_pars(vst_out, show_theta = TRUE, show_var = TRUE)
And just as a sanity check, could you use method = 'glmGamPoi'
and method = 'qpoisson'
when you call vst
? The model parameters should look very similar regardless of the method used.
We have used sctransform on a wide range of 10X data including some deeply sequenced one and did not notice major problems. How many cells are in your data set? How many genes detected, and how many UMI per cell?
Scaling the counts as a pre-processing step is NOT recommended. As we argue in the paper (and others as well), per-cell scaling factors are not a good idea, since the effect on gene expression is different depending on gene mean.
Dear Christoph,
Thanks for the fast reply. From your paper and vignettes we also expected that we should work with direct UMI counts but we were surprised by the results.
We are using 13K single cells from one of the cell types of this paper (https://www.nature.com/articles/s41588-021-00801-6). I don't have the estimated parameters at hand but I will recalculate them as soon as I can.
Please find attached the strange relation between residual variance and geometric mean I was talking about. We see a strange non-linear relation between the two. After scaling to 1/10000 UMIs or scaling by total UMIs per sample we see that this relation is no longer there and the residual variance is much much lower (max 150).
I have a feeling that these data have already been pre-processed in some way. If you are able to share the matrix that you use as input to vst
I'd be happy to take a look.
Note that the data on Zenodo apparently contains "scRNA-seq normalised counts".
We found the issue. We thought we had properly transformed back the data as was explained in the zenodo package. But the transformation back to the original count space didn't work as we expected. Thanks for your help and sorry for bothering!
The data on Zenodo is indeed normalized. I looked at D11.h5 and all cells have been scaled to a total of 13,003. For sample_id == 0 I reversed the normalization assuming that the lowest non-zero value was a one. When I use the resulting UMI count matrix (all integer counts) in vst
output looks as expected.
E.g.
vst_out <- vst(umi = counts, return_cell_attr = TRUE, method = 'qpoisson', bw_adjust = 1.5)
plot_model_pars(vst_out, show_theta = TRUE, show_var = TRUE)
ga <- tibble::rownames_to_column(vst_out$gene_attr, var = 'gene') %>%
mutate(highlight = rank(-residual_variance) <= 25)
ggplot(ga, aes(log10(gmean), sqrt(residual_variance), label = gene)) +
geom_point() + geom_smooth() +
geom_text_repel(data = filter(ga, highlight), size = 3, color = 'red') +
geom_point(data = filter(ga, highlight), size = 0.66, color = 'red') +
ggtitle('Residual variance as function of gene mean')
Thanks again!
Hello,
I have been trying to use sctransform on our 10X data. I would like to clarify what the exact input should be. From the example vignette it seems that these sound be direct UMI counts. However when I try to apply the method on a much deeper sequenced sample (more UMIs) I find that sctransform returns strange relations between the geometric mean and residual variance as plotted in the example code.
Given the large difference in sequencing depth, and the mean variance relationship as well as the log mean and detection fraction plots, I decided to scale my data to UMI per 10,000. After this transformation the method seems to work as expected. However, given that this is not documented at all in the example, the input seems to be direct UMI counts, I wanted to double check that this is the right path forward.
Thanks!