Closed EmeTexe closed 1 year ago
Hi Emeric,
There is of course differences in the pipelines and especially when it comes to the strategy for assigning reads to features (genes) there are a lot of options in zUMIs to optimize the parameters to your particular application. I do not know how any of this is done in STARsolo.
Unfortunately I won't be able to help you characterize any of such pipeline-specific differences.
Best, Christoph
Hello,
Describe the bug I have a demultiplexed (bcl2fastq) single cell dataset, which i want to map and count. The design is as follow :
I have 3 organisms in this dataset, and I know which cell is which organsim. What I did is mapping and counting with zUMIs, all cells (I only have 384 cells) on each organism independently. yaml file for mouse :
I then retrieve which cell is which organsim using the information I already have. I also tried mapping and counting using STARsolo . The command i launched for STARsolo is :
When comparing the results, zUMIs find 3 times more counts for each organism, and almost 2 times more genes. (Here only shown for intron+exon, as we have a mapping of 25% exon and 25% intron, we take the intron+exon matrix for analysis, but it's the same for exon only)![sbri1_compare_zumis_star](https://user-images.githubusercontent.com/63365770/168251100-e87e46b2-848a-4e99-a10a-963bccfe398f.png)
Intrigued, I tried to pursue the analysis, comparing the genes specific to zUMIs or specific to STARsolo (I will only show for the mouse, but it is almost the same for Chimpanzee). There are a lot of genes specific to zUMIs, and a few specific to STARsolo, with most of the genes being found in both.
Interestingly, when removing pseudgenes ("^Gm", "Rik$") and Ribosomal genes ("^Rp[sl]") we have 3 times less specific genes.
![no_pseudogenes](https://user-images.githubusercontent.com/63365770/168257365-d17c86af-0399-4f1f-88b3-b4908565694f.png)
Most of the genes specific for one or the other have very low counts (they are sum of counts across all the cells for the mouse, not counts per cell). zUMIs
STARsolo
When running clusterprofiler on the genes having at least 5 total counts (because obviously genes express in less than 5 cells don't go far in any analysis), no GO term were found, neither for zUMIs nor STARsolo.
I then tried to make a Seurat object combining zUMIs and STARsolo data, keeping track of which cell comes from which method. Normalization, FindMarkers between zUMIs and STARsolo condition, and keep markers having an average log2FC > 0.5 (for genes differentially expressed in zUMIs) and average log2FC < -0.5 (for genes differentially expressed in STARsolo). Then remove pseudogenes and ribosomal genes (we found an overrepresentation of the ribosomal genes in STARsolo DEG compared to zUMIs DEG). 953 DEG for zUMIs and 1016 for STARsolo. Then running clusterprofiler on those 2 list of genes. No GO found for zUMIs DEG, but there are GO found for STARsolo DEG. They are mainly GO linked to ribosomal process, but interestingly there is also response to leukemia, which is a process involving genes that we are interested in.
We also found some markers of our cells in STARsolo DEG, so it seems that STARsolo does a better job to get counts for our markers. Do you have any idea how there can be so much difference (especially on the number of counts and genes per cell), and why STARsolo seems to map better on ribosomal genes? Seeing this I don't know if I should use STARsolo (which have some of my markers in its DEG, and some GO) or zUMIs which have way more counts and genes.
Best, Emeric