sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

Mapping error at .tmpMap//tmp.test2.*.SJ.out.tab' #378

Open aliibarry opened 1 year ago

aliibarry commented 1 year ago

Trying to analyse some SMART-SEQ3 data and can't manage to get past the mapping step. Any suggestions would be much appreciated. I've remade my index multiple times (STAR --version is giving 2.7.3a, even though it's being flagged below as 2.7.1a?), and have also tried with using my own dependencies and STAR 2.7.11a, as well as a fresh zUMI pull (working with 2.9.7e).

bash zUMIs/zUMIs.sh -c -y patch-seq/patchseq.yaml

Currently using the yaml provided from smart-seq3 example (https://github.com/sandberg-lab/Smart-seq3/blob/master/allele_level_expression/mouse_cross.yaml) with num_threads: and mem_limit: adjusted, as well as no barcode_file:

Output is as follows

Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file:     patch-seq/patchseq.yaml
 zUMIs directory:               /home/amb/zUMIs
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 RAM limit:   100
 zUMIs version 2.9.7e

Thu Oct 26 18:22:15 CEST 2023
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
Thu Oct 26 19:26:56 CEST 2023
[1] "84 barcodes detected."
[1] "1705037 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 1739 daughter barcodes that can be binned into 84 parent barcodes."
[1] "Binned barcodes correspond to 1290360 reads."
Mapping...
[1] "2023-10-26 19:36:46 CEST"
Oct 26 19:36:50 ..... started STAR run
Oct 26 19:36:52 ..... loading genome
Oct 26 19:36:50 ..... started STAR run
Oct 26 19:36:52 ..... loading genome
Oct 26 19:36:50 ..... started STAR run
Oct 26 19:36:52 ..... loading genome
cp: cannot stat '/home/amb/patch-seq/zumis_out/zUMIs_output/.tmpMap//tmp.test2.*.SJ.out.tab': No such file or directory
[main_cat] ERROR: input is not BAM or CRAM
[main_cat] ERROR: input is not BAM or CRAM
Thu Oct 26 19:41:39 CEST 2023
Counting...
[1] "2023-10-26 19:41:49 CEST"
[1] "1.5e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/amb/patch-seq/zumis_out/test2.final_annot.gtf"
[E::hts_open_format] Failed to open file /home/amb/patch-seq/zumis_out/test2.filtered.tagged.Aligned.out.bam
samtools view: failed to open "/home/amb/patch-seq/zumis_out/test2.filtered.tagged.Aligned.out.bam" for reading: No such file or directory
[E::hts_open_format] Failed to open file /home/amb/patch-seq/zumis_out/test2.filtered.tagged.Aligned.out.bam
samtools view: failed to open "/home/amb/patch-seq/zumis_out/test2.filtered.tagged.Aligned.out.bam" for reading: No such file or directory
Error in gsub("SN:", "", chr) : object 'chr' not found
Calls: .makeSAF ... .chromLengthFilter -> [ -> [.data.table -> eval -> eval -> gsub
In addition: Warning message:
In data.table::fread(bread, col.names = c("chr", "len"), header = F) :
  File '/tmp/RtmpKL80mU/file2bd85af4ee1a' has size 0. Returning a NULL data.table.
Execution halted
Thu Oct 26 19:42:03 CEST 2023
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/amb/patch-seq/zumis_out/zUMIs_output/expression/test2.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Thu Oct 26 19:42:06 CEST 2023
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2023-10-26 19:42:06 CEST"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/amb/patch-seq/zumis_out/zUMIs_output/expression/test2.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

I've tried re-running this from the mapping step using which_Stage: Mapping in the YAML and get a slightly different error with an eventual Execution halted.

Thu Oct 26 20:08:07 CEST 2023
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Mapping...
[1] "2023-10-26 20:08:07 CEST"

EXITING because of FATAL INPUT ERROR: --readFilesType SAM requires specifying SE or PE reads
SOLUTION: specify --readFilesType SAM SE for single-end reads or --readFilesType SAM PE for paired-end reads

Oct 26 20:08:10 ...... FATAL ERROR, exiting
Thu Oct 26 20:08:10 CEST 2023
Counting...

As an aside: I'm trying to get this working on an HPC in parallel, but am still working through permission issues with the support team, any tips there would also be appreciated, error below.

starting zumi
Warning: YAML file doesn't include 'Rscript_exec' option; setting to 'Rscript'
Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!
mkdir: cannot create directory ‘/var/spool/slurmd/job6639867/zUMIs-env’: Permission denied
/data/userXXX/zUMIs/zUMIs.sh: line 155: /var/spool/slurmd/job6639867/zUMIs-miniconda.tar.bz2: Permission denied
cziegenhain commented 1 year ago

Hi,

That is indeed odd. Can you share the exact yaml file you use? Do you get an unmapped.bam file in your outputs, if yes how does it look? (eg. first few lines of samtools view)

Regarding the warning on the STAR version should be OK - STAR doesn't always write the precise version number into its index files.

Best, Christoph

aliibarry commented 1 year ago

Hiya,

YAML is:

project: trial
sequence_files:
  file1:
    name: /home/amb/patchseq/Undetermined_S0_R1_001.fastq.gz
    base_definition:
      - cDNA(23-50)
      - UMI(12-19)
    find_pattern: ATTGCGCAATG
  file2:
    name: /home/amb/patchseq/Undetermined_S0_R2_001.fastq.gz
    base_definition:
      - cDNA(1-50)
  file3:
    name: /home/amb/patchseq/Undetermined_S0_I1_001.fastq.gz
    base_definition:
      - BC(1-8)
  file4:
    name: /home/amb/patchseq/Undetermined_S0_I2_001.fastq.gz
    base_definition:
      - BC(1-8)
reference:
  STAR_index: /home/amb/hg_genome_STAR2.7.3a #made without overhang info
    #pigz_exec: /home/amb/miniconda3/bin/pigz
    #STAR_exec: /home/amb/STAR-2.7.11a/source/STAR
    #samtools_exec: /home/amb/samtools-1.18/samtools
  Rscript_exec: /usr/bin/R
  GTF_file: /home/amb/gencode.v44.primary_assembly.annotation.gtf
  additional_STAR_params: '--limitSjdbInsertNsj 2000000 --clip3pAdapterSeq CTGTCTCTTATACACATCT'
  additional_files:
out_dir: /home/amb/patchseq/out
num_threads: 1
mem_limit: 31
filter_cutoffs:
  BC_filter:
    num_bases: 3
    phred: 20
  UMI_filter:
    num_bases: 3
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: 
  automatic: no
  BarcodeBinning: 1
  nReadsperCell: 100
  demultiplex: yes
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 1
  write_ham: yes
  velocyto: no
  primaryHit: yes
  twoPass: no
make_stats: yes
which_Stage: Filtering

There is an unmapped.bam, but is seems incomplete? For out_dir/trial.filtered.tagged.unmapped.bam, this is the head:

VH01324:51:AAF5FKVM5:1:1101:18231:1000  77  *   0   0   *   *   0   0   GCTTTGTATAAACCAGTGATTTTACTACAAAAAACACTGTCCTTGAAAGA  CCCCCCCCCCC;;CCC;C;CCCCCCCCCCCCCCCCCCCCCCCC;CCCCCC  BX:Z:ATCTCAGGTACTCCTT   BC:Z:ATCTCAGGTACTCCTT   UB:Z:   QB:Z:CC;CC;CCCCCCCCCC   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18231:1000  141 *   0   0   *   *   0   0   CTTCTTAAGTGGAATATTCTAATAAGCTACCTTTTGTAAGTGCCATGTTT  CCCCCCCCCCCC-CC-CCCCCCC;CCCCCCCCCC-CCCCCCC-CCCCCCC  BX:Z:ATCTCAGGTACTCCTT   BC:Z:ATCTCAGGTACTCCTT   UB:Z:   QB:Z:CC;CC;CCCCCCCCCC   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18307:1000  77  *   0   0   *   *   0   0   CCCAGAGAGTGGGTCAGCTGGAAGCCCTGGAGACAGTCACAGCTCTCTGA  CCC-C;C-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;  BX:Z:CGAGGCTGCGGAGAGA   BC:Z:CGAGGCTGCGGAGAGA   UB:Z:   QB:Z:CC-CCCCCCCC-CC;C   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18307:1000  141 *   0   0   *   *   0   0   GCCTGGCACCATGGACTCTGTCAGGTCTGGACCCTTCGGCCAGATCTTCA  ;CCCCCC;CCCCCCCC;CCCCC;CCCCCCCCCC;CCCCCCCCC;-C;;CC  BX:Z:CGAGGCTGCGGAGAGA   BC:Z:CGAGGCTGCGGAGAGA   UB:Z:   QB:Z:CC-CCCCCCCC-CC;C   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18345:1000  77  *   0   0   *   *   0   0   TCCCTGGAGCGGCAGCTCAGCGACATCGAGGAGCGCCACAACCACGACCT  CCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC  BX:Z:CGTCCTAGCTCCTTAC   BC:Z:CGTACTAGCTCCTTAC   UB:Z:   QB:Z:CCC-CC;CCCCCCCCC   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18345:1000  141 *   0   0   *   *   0   0   GTATACAGTGGCCCAGTGATGCTTCCTGCAAATGTGCTAAATCTAGTCTC  ;CCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC  BX:Z:CGTCCTAGCTCCTTAC   BC:Z:CGTACTAGCTCCTTAC   UB:Z:   QB:Z:CCC-CC;CCCCCCCCC   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18383:1000  77  *   0   0   *   *   0   0   AAAGAAGATATTGCAATGTGGGAAGTAAATGAAGCCTTTAGTCTGGTTGT  CC;CCCC-CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC  BX:Z:CTCTCTACAGGCTTAG   BC:Z:CTCTCTACAGGCTTAG   UB:Z:   QB:Z:CCCCCCCCCCCCC-;C   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18383:1000  141 *   0   0   *   *   0   0   GCATGAGTCAAATGACCAACAATCCTGGCTCCAGACATCCCAATTGGATG  C-CCC-CCCC;CCCCCCC-CCCCCCCCCCCCCCC;C;CCCCCCCCCCCCC  BX:Z:CTCTCTACAGGCTTAG   BC:Z:CTCTCTACAGGCTTAG   UB:Z:   QB:Z:CCCCCCCCCCCCC-;C   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18459:1000  77  *   0   0   *   *   0   0   GATATAGTTTGAGTATTTGTCCTCTTCAAATCTCATGTTGAAATGTTATC  CCC;CCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC  BX:Z:GCTCATGAATTAGACG   BC:Z:GCTCATGAATTAGACG   UB:Z:   QB:Z:CCCCCCCCCCCCCCCC   QU:Z:
VH01324:51:AAF5FKVM5:1:1101:18459:1000  141 *   0   0   *   *   0   0   TTTTAAAACCAGCTCTCACATGAGCTAATGGAATAAGAACTCACTCATTA  CCCCCC;CCCCCCCCCCCCCCC;CCCCCCCCC-C;CC-CCCCCCCCCCCC  BX:Z:GCTCATGAATTAGACG   BC:Z:GCTCATGAATTAGACG   UB:Z:   QB:Z:CCCCCCCCCCCCCCCC   QU:Z:
cziegenhain commented 1 year ago

Hey,

OK that looks actually quite good for the unmapped bam, and it did clearly set the PE flags correctly to the reads which is what STAR complained about.

Anyways, my gut feeling is the commented out lines in the "reference" section may disturb things in the yaml! Please remove them completely and have a check

 #pigz_exec: /home/amb/miniconda3/bin/pigz
    #STAR_exec: /home/amb/STAR-2.7.11a/source/STAR
    #samtools_exec: /home/amb/samtools-1.18/samtools
aliibarry commented 1 year ago

Thanks for the quick reply. I removed all comments from the yaml, but am getting the same issues. Unmapped bam output is still generated, fails during the mapping stage.

I did a fully fresh run as well, but this is the error when starting with Mapping with bash zUMIs/zUMIs.sh -c -y patchseq/patchseq.yaml

Warning: YAML file doesn't include 'pigz_exec' option; setting to 'pigz'
Warning: YAML file doesn't include 'STAR_exec' option; setting to 'STAR'
Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file: patchseq/patchseq.yaml
 zUMIs directory:       /home/amb/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   31
 zUMIs version 2.9.7e 

Tue Oct 31 02:20:58 PM CET 2023
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Mapping...
[1] "2023-10-31 14:20:58 CET"

EXITING because of FATAL INPUT ERROR: --readFilesType SAM requires specifying SE or PE reads
SOLUTION: specify --readFilesType SAM SE for single-end reads or --readFilesType SAM PE for paired-end reads

Oct 31 14:20:59 ...... FATAL ERROR, exiting
Tue Oct 31 02:20:59 PM CET 2023
Counting...
[1] "2023-10-31 14:21:02 CET"
[1] "46500000 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/amb/patchseq/out2/trial.final_annot.gtf"
Error in gsub("SN:", "", chr) : object 'chr' not found
Calls: .makeSAF ... .chromLengthFilter -> [ -> [.data.table -> eval -> eval -> gsub
In addition: Warning message:
In data.table::fread(bread, col.names = c("chr", "len"), header = F) :
  File '/tmp/RtmpdYJWcf/file69191ccf16f2' has size 0. Returning a NULL data.table.
Execution halted

Possibly relevant: during one trial one point I saw an error with Fastq files are not in the same order but I haven't managed to replicate the error - I think it was because I was overwriting the output directory?

aliibarry commented 1 year ago

Just updating for anyone else seeing the same issues - I never resolved this and instead switched to a kallisto-bustools pipeline, which now has a smart-seq3 option. See biostars post.

Another option that worked for me was umi_tools > samtools > umi_tools dedup > feature counts.