sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
273 stars 67 forks source link

Question about multimapping reads #353

Closed MartaBenegas closed 1 year ago

MartaBenegas commented 1 year ago

Hi zUMIs team!

May I ask how zUMIs handles multi-mapping reads? Does it keep only unique reads for counting or distribute them somehow among the genes?

You can have a look at the section "Multi-Gene reads" to know what I'm referring to: https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md

Thanks in advance!

cziegenhain commented 1 year ago

Hi,

Within zUMIs, you can choose to either count only uniquely aligned reads or include primary hits or multimappers if they fall within gene boundaries (set primary_hit: yesin YAML)

Best, Christoph

royfrancis commented 1 year ago

Is it possible to get the STAR log file? How many reads are mapping/unmapped? Why were reads unmapped? How many multi-mapped?

                             Started job on |   Apr 23 23:17:02
                         Started mapping on |   Apr 23 23:17:04
                                Finished on |   Apr 23 23:26:52
   Mapping speed, Million of reads per hour |   115.68

                      Number of input reads |   18894432
                  Average input read length |   298
                                UNIQUE READS:
               Uniquely mapped reads number |   17704240
                    Uniquely mapped reads % |   93.70%
                      Average mapped length |   297.39
                   Number of splices: Total |   3119841
        Number of splices: Annotated (sjdb) |   2663436
                   Number of splices: GT/AG |   3080422
                   Number of splices: GC/AG |   14219
                   Number of splices: AT/AC |   248
           Number of splices: Non-canonical |   24952
                  Mismatch rate per base, % |   0.49%
                     Deletion rate per base |   0.02%
                    Deletion average length |   2.70
                    Insertion rate per base |   0.02%
                   Insertion average length |   2.30
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   806405
         % of reads mapped to multiple loci |   4.27%
    Number of reads mapped to too many loci |   1146
         % of reads mapped to too many loci |   0.01%
cziegenhain commented 1 year ago

Hi,

The STAR final log will always be part of your zUMIs output, in the same folder as the bam files. In the zUMIs_output /stats folder zUMIs also produces more detailed mapping statistics.

MartaBenegas commented 1 year ago

On that matter, I've run zUMIs with one pair of fastq files:

project: dog01
sequence_files:
  file1:
    name: /data/input/G556-LBA-Dog01_S1_R1_001.fastq.gz
    base_definition:
    - BC(1-16)
    - UMI(17-26)
  file2:
    name: /data/input/G556-LBA-Dog01_S1_R2_001.fastq.gz
    base_definition: cDNA(1-59)
reference:
  STAR_index: /data/input/starIdx
  GTF_file: /data/input/Canis_lupus_familiaris.ROS_Cfam_1.0.109.chr.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /data/output
num_threads: 32
mem_limit: 0
filter_cutoffs:
  BC_filter:
    num_bases: 2
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: /data/input/737K-august-2016.txt
  automatic: no
  BarcodeBinning: 1
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 1
  Ham_Dist: 0
  velocyto: no
  primaryHit: no
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: /usr/bin/STAR-2.7.10b/source/STAR
pigz_exec: pigz
samtools_exec: samtools

But there are 4 STAR reports on the dog01.filtered.tagged.Log.final.out file. Why is that so? Find logs attached. dog01.filtered.tagged.Log.final.out.txt dog01_unique.log.txt

cziegenhain commented 1 year ago

Because you have not specified a memory limit, zUMIs was able to run STAR in 4 parallel instances to speed up the processing.