Closed carmensandoval closed 2 years ago
YAML:
project: smartseq_tiny
sequence_files:
file1:
name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/R1.fastq.gz
base_definition:
- cDNA(23-52)
- UMI(12-19)
find_pattern: NTTGCGCAATG
file2:
name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/R2.fastq.gz
base_definition:
- cDNA(1-70)
file3:
name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/I1.fastq.gz
base_definition:
- BC(1-8)
file4:
name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/I2.fastq.gz
base_definition:
- BC(1-8)
reference:
STAR_index: /gstore/data/ctgbioinfo/sandovc9/genomes/star_2.7.1b_nogtf # Built with 2.7.1a
GTF_file: /gne/data/dnaseq/analysis/aplle/genomes/GRCh38_smartseq3/Homo_sapiens.GRCh38.84.hgnc.gtf
additional_STAR_params: '--clip3pAdapterSeq CTGTCTCTTATACACATCT'
out_dir: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny
num_threads: 8
mem_limit: null
filter_cutoffs:
BC_filter:
num_bases: 3
phred: 20
UMI_filter:
num_bases: 3
phred: 20
barcodes:
barcode_num: null
barcode_file: null # /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/expected_barcodes_i5.txt
automatic: yes
BarcodeBinning: 1
nReadsperCell: null
demultiplex: yes
counting_opts:
introns: yes
downsampling: '0'
strand: 0
Ham_Dist: 1
write_ham: no
velocyto: no
primaryHit: yes
twoPass: no
make_stats: yes
which_Stage: Filtering
zUMIs_directory: /gstore/data/ctgbioinfo/sandovc9/bin/zUMIs/zuMIs.sh
Hi,
This error means that there is no diversity in barcodes. Have you checked that the barcode read files are intact (eg not all NNNNNNNN).
You could also upload smartseq_tiny.BCstats.txt
and I take a look at that.
Some further comments on your yaml:
find_pattern: NTTGCGCAATG
-> Is there a reason why you have N as the first base here? I remember you said you use Takara's new LP UMI kit. Now, I do not know if Takara changed anything relative to the TSO in Smart-seq3 for this kit, but for zUMIs to perform correctly in "Smart-seq3" mode, the pattern to find must be ATTGCGCAATG
file4
nReadsperCell
at its default of 100 when performing automatic barcode detection. I am unsure of using null, safer to have an integer there.strand: 1
zUMIs_directory
should point to the folder not the .sh file, but there should be no need to set this as it will be automatically determined from your zUMIs.sh callBest, Christoph
Thanks for your elaborate response, @cziegenhain
I fixed this issue by changing nReadsperCell
from null to 100, so as you stated in one of your points about my YAML, not setting this to an integer during auto BC detection is problematic.
find_pattern: NTTGCGCAATG -> Is there a reason why you have N as the first base here?
I had set the first base to N as I took a quick look through the fastq and saw that all of the reads started with an N intead of an A, followed by the rest of the pattern. I later became aware that it's expected for the first hundred/thousands reads to have N/lower quality bases as those come from the edges of the tile (?) where base detection is less accurate. Will change back to A.
For Smart-seq3-style data, UMI reads are positively stranded, to use this strand information, set strand: 1
Thanks for the tip!
zUMIs_directory should point to the folder not the .sh file
noted - thanks!
I am able to run the whole thing with the test data and my full STAR index, but when I try to run on a subset of my sequencign run (downsampled to 10M reads), I get the following errors during mapping:
It all starts with the error here, but it continues to attempt mapping, counting etc. What is this error due to?
Full output: