Type of count data - Githubissues

cutleraging commented 11 months ago

Hello,

I saw your paper and have been trying out this package. Well done!

I am wondering if you have found any differences when testing with droplet (10X) vs full length (smart-seq) data. I also notice that the input to the scSHC() is raw counts that have not been log transformed and that if I do this I get different results. What is the correct thing to do here?

I also wonder what your advise is on choosing an optimal settings for num_features and num_PCs? I see that when you get the your highly variable genes that you do not account for the log-mean trend which might bias which genes are used here.

Finally, I am wondering if you did any test as to what the least number of cells/genes can be used here?

Thanks, Ronnie

igrabski commented 11 months ago

Hi Ronnie,

Thanks so much for trying out our package!

In terms of input, our model is based on what is appropriate for UMI data, so I would not expect good performance with non-UMI data like Smart-seq2. Since Smart-seq3 produces UMIs, in principle that should be okay for our approach, although we primarily only tested 10x data. And yes, the input should only be raw counts.

For num_features and num_PCs, we typically have not found the results to be highly affected by the choice of these parameters, and so in most cases I think the default values (2500 features and 30 PCs) should work well. If you expect more heterogeneity in the data, increasing the number of PCs could be a good idea. In terms of feature selection, we choose genes using the devianceFeatureSelection method from the scry package, which was originally introduced by this paper (see section "Feature selection using deviance".) This method works by identifying genes that are not well-described by a multinomial model of constant expression, and has been shown to preferentially select genes that are both highly expressed and highly variable (as opposed to highly, but constantly, expressed).

In terms of cells/genes, we did some experiments in downsampling the number of cells used and have found that the effectiveness of our method depends on both the sample size and the strength of the clustering signal. If clusters are very strongly separated and have many differences, we can detect them with very high certainty even with 50 cells (the smallest cluster size we tested). However, if clusters are very similar, then more cells are needed to find them; e.g. in an experiment where two clusters differed by only 10 genes, we could not detect the difference with only 50 cells. We did not experiment with greatly reducing the number of genes used. Finally, I will also note that those experiments were run with alpha = 0.05, which is a relatively conservative setting; increasing alpha will make it more likely to find clusters, at the risk of possibly introducing false positives.

Hope that helps! Please let me know if you have any other questions!

cutleraging commented 11 months ago

Thanks for the reply!

Do you know what would need to be modified for it to be compatible with smart-seq data? Are there certain modeling considerations that are specific to the distributions of UMI data?

Ronnie

igrabski commented 11 months ago

This same paper actually also happens to discuss some differences between UMI and read count distributions, and shows that read count distributions are both zero-inflated and multi-modal. The multi-modality here would be particularly challenging -- if the distribution of counts for a gene in a homogeneous setting is already multi-modal, then it would be very difficult to distinguish inherent multimodality from multimodality due to heterogeneous populations (in our approach, we are essentially testing to see whether clusters are best described by one or more distributions).

One thought I did have is that you could apply quminorm to read count data, which produces approximate UMI counts. Those could, in principle, then be fed directly into our approach. That said, I would definitely proceed with caution if you do try this -- I haven't at all tested this idea, and it's certainly possible that these approximate UMI counts don't completely behave the same way.

igrabski / sc-SHC

Type of count data #12