sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
269 stars 67 forks source link

Low exon or intron+exon counts and genes in STARsolo compared to counts and genes found in zUMIs #317

Closed EmeTexe closed 1 year ago

EmeTexe commented 2 years ago

Hello,

Describe the bug I have a demultiplexed (bcl2fastq) single cell dataset, which i want to map and count. The design is as follow :

R1: Read 1 (26 nt) including the cell specific barcode in position 1-10, as well as the UMI in position 11-18. Bases after position 18 are the Poly A tail and can be ignored for analysis.
R2: Read 2 (90 nt) comprises the cDNA in all positions. Use this read for mapping against the genome.

I have 3 organisms in this dataset, and I know which cell is which organsim. What I did is mapping and counting with zUMIs, all cells (I only have 384 cells) on each organism independently. yaml file for mouse :

project: SBRI1
sequence_files:
  file1:
    name: /mnt/Data2/raw_data/SBRI/new/SBRI1_S1_R1_001.fastq.gz
    base_definition:
    - BC(1-10)
    - UMI(11-18)
  file2:
    name: /mnt/Data2/raw_data/SBRI/new/SBRI1_S1_R2_001.fastq.gz
    base_definition: cDNA(1-90)
reference:
  STAR_index: /mnt/Data1/genomes/m_musclus_GRCm38/zUMIs_genome_GFP
  GTF_file: /mnt/Data1/genomes/m_musclus_GRCm38/Mus_musculus.GRCm38.102_GFP.gtf
  additional_files: ~
out_dir: /mnt/Data1/SBRI/zUMIs_alignment/SBRI1_mouse
num_threads: 32
mem_limit: 56
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: /mnt/Data1/SBRI/whitelist.txt
  automatic: no
  BarcodeBinning: 0
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

I then retrieve which cell is which organsim using the information I already have. I also tried mapping and counting using STARsolo . The command i launched for STARsolo is :

STAR --genomeDir /mnt/Data1/genomes/m_musclus_GRCm38/STAR_genome_GFP --readFilesIn /mnt/Data2/raw_data/SBRI/new/SBRI1_S1_R2_001.fastq.gz /mnt/Data2/raw_data/SBRI/new/SBRI1_S1_R1_001.fastq.gz --readFilesCommand zcat --outSAMunmapped Within KeepPairs --outSAMorder PairedKeepInputOrder --soloType CB_UMI_Simple --soloCBstart 1 --soloCBlen 10 --soloUMIstart 11 --soloUMIlen 8 --soloCBwhitelist /mnt/Data1/SBRI/whitelist.txt --soloBarcodeReadLength 0 --soloCellReadStats Standard --soloCellFilter None --soloStrand Forward --soloFeatures GeneFull_Ex50pAS --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR --runThreadN 32

When comparing the results, zUMIs find 3 times more counts for each organism, and almost 2 times more genes. (Here only shown for intron+exon, as we have a mapping of 25% exon and 25% intron, we take the intron+exon matrix for analysis, but it's the same for exon only) sbri1_compare_zumis_star

Intrigued, I tried to pursue the analysis, comparing the genes specific to zUMIs or specific to STARsolo (I will only show for the mouse, but it is almost the same for Chimpanzee). There are a lot of genes specific to zUMIs, and a few specific to STARsolo, with most of the genes being found in both. genes_starsolo_zumis Interestingly, when removing pseudgenes ("^Gm", "Rik$") and Ribosomal genes ("^Rp[sl]") we have 3 times less specific genes. no_pseudogenes

Most of the genes specific for one or the other have very low counts (they are sum of counts across all the cells for the mouse, not counts per cell). zUMIs counts_group_zUMIs STARsolo counts_group_starsolo When running clusterprofiler on the genes having at least 5 total counts (because obviously genes express in less than 5 cells don't go far in any analysis), no GO term were found, neither for zUMIs nor STARsolo.

I then tried to make a Seurat object combining zUMIs and STARsolo data, keeping track of which cell comes from which method. Normalization, FindMarkers between zUMIs and STARsolo condition, and keep markers having an average log2FC > 0.5 (for genes differentially expressed in zUMIs) and average log2FC < -0.5 (for genes differentially expressed in STARsolo). Then remove pseudogenes and ribosomal genes (we found an overrepresentation of the ribosomal genes in STARsolo DEG compared to zUMIs DEG). 953 DEG for zUMIs and 1016 for STARsolo. Then running clusterprofiler on those 2 list of genes. No GO found for zUMIs DEG, but there are GO found for STARsolo DEG. They are mainly GO linked to ribosomal process, but interestingly there is also response to leukemia, which is a process involving genes that we are interested in.

We also found some markers of our cells in STARsolo DEG, so it seems that STARsolo does a better job to get counts for our markers. Do you have any idea how there can be so much difference (especially on the number of counts and genes per cell), and why STARsolo seems to map better on ribosomal genes? Seeing this I don't know if I should use STARsolo (which have some of my markers in its DEG, and some GO) or zUMIs which have way more counts and genes.

zUMIs version 2.9.7
STAR version 2.7.10a
R version 4.1.0
Seurat version 4.1.0
clusterProfiler version 4.0.5 

Best, Emeric

cziegenhain commented 1 year ago

Hi Emeric,

There is of course differences in the pipelines and especially when it comes to the strategy for assigning reads to features (genes) there are a lot of options in zUMIs to optimize the parameters to your particular application. I do not know how any of this is done in STARsolo.

Unfortunately I won't be able to help you characterize any of such pipeline-specific differences.

Best, Christoph