sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

Error in uik(bccount$cellindex, bccount$cs/1000) : Method is not applicable for such a small vector. Please give at least a 5 numbers vector #333

Closed carmensandoval closed 2 years ago

carmensandoval commented 2 years ago

I am able to run the whole thing with the test data and my full STAR index, but when I try to run on a subset of my sequencign run (downsampled to 10M reads), I get the following errors during mapping:

It all starts with the error here, but it continues to attempt mapping, counting etc. What is this error due to?

Error in uik(bccount$cellindex, bccount$cs/1000) : 
  Method is not applicable for such a small vector. Please give at least a 5 numbers vector
Calls: cellBC -> .cellBarcode_unknown -> .FindBCcut -> uik
Execution halted

Full output:

> zUMIs.sh -c -y zUMIs_tiny.yaml

Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file:     zUMIs_tiny.yaml
 zUMIs directory:               /gstore/data/ctgbioinfo/sandovc9/bin/zUMIs
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 RAM limit:   null
 zUMIs version 2.9.7c 

Sat Sep 24 15:23:27 PDT 2022
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.

Filtering...

Error in uik(bccount$cellindex, bccount$cs/1000) : 
  Method is not applicable for such a small vector. Please give at least a 5 numbers vector
Calls: cellBC -> .cellBarcode_unknown -> .FindBCcut -> uik
Execution halted

Mapping...
[1] "2022-09-24 15:26:09 PDT"
Sep 24 15:26:10 ..... started STAR run
Sep 24 15:26:10 ..... loading genome
Sep 24 15:26:10 ..... started STAR run
Sep 24 15:26:10 ..... loading genome
Sep 24 15:26:10 ..... started STAR run
Sep 24 15:26:10 ..... loading genome
Sep 24 15:26:18 ..... processing annotations GTF
Sep 24 15:26:19 ..... processing annotations GTF

Sep 24 15:26:36 ..... inserting junctions into the genome indices
Sep 24 15:26:36 ..... inserting junctions into the genome indices

sh: line 1: 31789 Bus error               STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /gstore/data/ctgbioinfo/sandovc9/genomes/star_2.7.1b_nogtf --sjdbGTFfile /gne/data/dnaseq/analysis/aplle/genomes/GRCh38_smartseq3/Homo_sapiens.GRCh38.84.hgnc.gtf --runThreadN 2 --sjdbOverhang 29 --readFilesType SAM PE --clip3pAdapterSeq CTGTCTCTTATACACATCT --readFilesIn /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyaa.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyab.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyac.filtered.tagged.bam --outFileNamePrefix /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMap//tmp.smartseq_tiny.1.
sh: line 1: 31790 Bus error               STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /gstore/data/ctgbioinfo/sandovc9/genomes/star_2.7.1b_nogtf --sjdbGTFfile /gne/data/dnaseq/analysis/aplle/genomes/GRCh38_smartseq3/Homo_sapiens.GRCh38.84.hgnc.gtf --runThreadN 2 --sjdbOverhang 29 --readFilesType SAM PE --clip3pAdapterSeq CTGTCTCTTATACACATCT --readFilesIn /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyad.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyae.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyaf.filtered.tagged.bam --outFileNamePrefix /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMap//tmp.smartseq_tiny.2.
sh: line 1: 31791 Bus error               STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /gstore/data/ctgbioinfo/sandovc9/genomes/star_2.7.1b_nogtf --sjdbGTFfile /gne/data/dnaseq/analysis/aplle/genomes/GRCh38_smartseq3/Homo_sapiens.GRCh38.84.hgnc.gtf --runThreadN 2 --sjdbOverhang 29 --readFilesType SAM PE --clip3pAdapterSeq CTGTCTCTTATACACATCT --readFilesIn /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyag.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyah.filtered.tagged.bam,/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMerge//smartseq_tiny.smartseq_tinyai.filtered.tagged.bam --outFileNamePrefix /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/.tmpMap//tmp.smartseq_tiny.3.
[main_cat] ERROR: input is not BAM or CRAM
[main_cat] ERROR: input is not BAM or CRAM

Sat Sep 24 15:28:03 PDT 2022
Counting...
[1] "2022-09-24 15:28:14 PDT"
Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes_binned.txt")) : 
  File '/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/smartseq_tinykept_barcodes_binned.txt' does not exist or is non-readable. getwd()=='/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny'
Execution halted
Sat Sep 24 15:28:14 PDT 2022
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/expression/smartseq_tiny.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Sat Sep 24 15:28:18 PDT 2022
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2022-09-24 15:28:18 PDT"
Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project,  : 
  File '/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/zUMIs_output/smartseq_tinykept_barcodes.txt' does not exist or is non-readable. getwd()=='/gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny'
Execution halted
carmensandoval commented 2 years ago

YAML:

project: smartseq_tiny
sequence_files:
  file1:
    name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/R1.fastq.gz
    base_definition:
      - cDNA(23-52)
      - UMI(12-19)
    find_pattern: NTTGCGCAATG
  file2:
    name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/R2.fastq.gz
    base_definition:
      - cDNA(1-70)
  file3:
    name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/I1.fastq.gz
    base_definition:
      - BC(1-8)
file4:
    name: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/tiny_fastq/10M_reads/I2.fastq.gz
    base_definition:
      - BC(1-8)
reference:
  STAR_index: /gstore/data/ctgbioinfo/sandovc9/genomes/star_2.7.1b_nogtf # Built with 2.7.1a
  GTF_file: /gne/data/dnaseq/analysis/aplle/genomes/GRCh38_smartseq3/Homo_sapiens.GRCh38.84.hgnc.gtf
  additional_STAR_params: '--clip3pAdapterSeq CTGTCTCTTATACACATCT'
out_dir: /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny
num_threads: 8
mem_limit: null
filter_cutoffs:
  BC_filter:
    num_bases: 3
    phred: 20
  UMI_filter:
    num_bases: 3
    phred: 20
barcodes:
  barcode_num: null
  barcode_file: null # /gstore/data/ctgbioinfo/sandovc9/soma-seq/bulk_RNA/zUMIs_tiny/expected_barcodes_i5.txt
  automatic: yes
  BarcodeBinning: 1
  nReadsperCell: null
  demultiplex: yes
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 1
  write_ham: no
  velocyto: no
  primaryHit: yes
  twoPass: no
make_stats: yes
which_Stage: Filtering
zUMIs_directory: /gstore/data/ctgbioinfo/sandovc9/bin/zUMIs/zuMIs.sh
cziegenhain commented 2 years ago

Hi,

This error means that there is no diversity in barcodes. Have you checked that the barcode read files are intact (eg not all NNNNNNNN). You could also upload smartseq_tiny.BCstats.txt and I take a look at that.

Some further comments on your yaml:

Best, Christoph

carmensandoval commented 2 years ago

Thanks for your elaborate response, @cziegenhain

I fixed this issue by changing nReadsperCell from null to 100, so as you stated in one of your points about my YAML, not setting this to an integer during auto BC detection is problematic.

find_pattern: NTTGCGCAATG -> Is there a reason why you have N as the first base here?

I had set the first base to N as I took a quick look through the fastq and saw that all of the reads started with an N intead of an A, followed by the rest of the pattern. I later became aware that it's expected for the first hundred/thousands reads to have N/lower quality bases as those come from the edges of the tile (?) where base detection is less accurate. Will change back to A.

For Smart-seq3-style data, UMI reads are positively stranded, to use this strand information, set strand: 1

Thanks for the tip!

zUMIs_directory should point to the folder not the .sh file

noted - thanks!