sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

zUMIs clarification for variable position cDNA sequence #267

Closed jelber2 closed 3 years ago

jelber2 commented 3 years ago

Hi,

I have the following situation where I have bases 1-12 of Read1 as a UMI sequence followed by a constant sequence (AGGACGAAACACCG shown in red below ) that might or might not be exactly after the UMI, and the cDNA sequence is always exactly after the constant sequence.

Screenshot from 2021-06-23 14-35-32

I was wondering if you thought the following YAML might be appropriate, especially the settings for file1 given the cDNA sequence is not always in a fixed position

project: test
sequence_files:
  file1:
    name: R1.fastq.gz
    base_definition:
    - cDNA(27-52)
    - UMI(1-12)
    find_pattern: AGGACGAAACACCG
  file2:
    name: I1.fastq.gz
    base_definition: BC(1-10)
  file3:
    name: I2.fastq.gz
    base_definition: BC(1-10)
reference:
  STAR_index: /nfs/scistore16/itgrp/jelbers/GRCh38.primary_assembly_gencode.v38.annotation_STAR-2.7.3a_sjdbOverhang25/
  GTF_file: /nfs/scistore16/itgrp/jelbers/gencode.v38.annotation.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /nfs/scistore16/itgrp/jelbers/test
num_threads: 48
mem_limit: 0
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: ~
  automatic: yes
  BarcodeBinning: 0
  nReadsperCell: 1
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: no
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

Any help would be greatly appreciated.

cziegenhain commented 3 years ago

Hi,

Is this maybe similar to the InDrops or ddSeq setup https://github.com/sdparekh/zUMIs/wiki/Protocol-specific-setup#ddseq--surecell-3?

I'm thinking you should be using the correct_frameshift: AGGACGAAACACCG option instead of find_pattern. Maybe @sdparekh can comment because I haven't used this option in a good while, but if you give the base ranges as in the example above, eg. UMI(1-12), it will always be the 12 bases before the found pattern used as UMI and your cDNA(27-52) would then always start directly after the pattern.

Let me know how it goes,

Christoph

jelber2 commented 3 years ago

Thank you! I will give correct_frameshit a try instead of find_pattern. It is actually more of a custom setup, but I guess it is in principle similar to the InDrops setup.

jelber2 commented 3 years ago

So, there is an empty BCstats.txt file if I use correct_framshift but not find_pattern

Error in eval(bysub, parent.frame(), parent.frame()) : 
  object 'XC' not found
Calls: cellBC -> [ -> [.data.table -> eval -> eval
In addition: Warning message:
In data.table::fread(bccount_file, header = FALSE, col.names = c("XC",  :
  File '/nfs/scistore16/itgrp/jelbers/test/test.BCstats.txt' has size 0. Returning a NULL data.table.

I will report back if the results seem particularly strange

cziegenhain commented 3 years ago

Not too hopeful if the BCstats.txt is empty - that sounds like no barcodes were extracted. Feel free to reopen the issue if you need additional help.