sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

suspect error with samtools from miniconda #283

Closed roxyisat-rex closed 2 years ago

roxyisat-rex commented 3 years ago

Hello

I want to report an issue with zUMI. I am running on our college's HPC which is PBS and the output and error files are separate outputs so I will post both. The error log says

pigz: abort: write error on <stdout> (Broken pipe)
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Vim: Warning: Output is not to a terminal
Vim: Warning: Input is not from a terminal
Warning message:
In data.table::fread(cmd = paste(samtools, "view", filtered_bams[1],  :
  Stopped early on line 3. Expected 4 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<_\â^@.<8f>^AÐxG<99><84>­Õ^GS(Ûâ+u±;àÛ<80>9N8öÖ^^@ñ_.^S·?Þ|Ë<88>G^_¢^^Ä(½jâ^P<9a>a>>
cat: invalid option -- 'o'
Try 'cat --help' for more information.
VIM - Vi IMproved 8.0 (2016 Sep 12, compiled Nov 11 2019 19:07:48)
Unknown option argument: "-@"
Unknown option argumentUnknown option argument: ": "-@-@""
E26: Hebrew cannot be used: Not enabled at compile time
Error in data.table::fread(bread, col.names = c("chr", "len"), header = F) : 
  input= contains no \n or \r, but starts with a space. Please remove the leading space, or use text=, file= or cmd=
Calls: .makeSAF -> .chromLengthFilter -> <Anonymous>
Execution halted
Loading required package: yaml
Loading required package: Matrix
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/rds/general/user/xyz16/home/bob/zUMIs_output/zUMIs_output/expression/SS3_human_2samples_test.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/rds/general/user/xyz16/home/bob/zUMIs_output/zUMIs_output/expression/SS3_human_2samples_test.dgecounts.rds', probable reason 'No such file or directory'
Execution halted

The output log says:

Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file: non_demulti_test.yaml
 zUMIs directory:       /rds/general/user/xyz16/home/bob/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   100
 zUMIs version 2.9.7 

Thu 26 Aug 12:04:54 BST 2021
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
Thu 26 Aug 12:07:15 BST 2021
[1] "184 barcodes detected."
[1] "1024654 reads were assigned to barcodes that do not correspond to intact cells."
Mapping...
[1] "2021-08-26 12:07:22 BST"
Aug 26 12:07:26 ..... started STAR run
Aug 26 12:07:27 ..... loading genome
Aug 26 12:07:26 ..... started STAR run
Aug 26 12:07:27 ..... loading genome
Aug 26 12:07:26 ..... started STAR run
Aug 26 12:07:27 ..... loading genome
Aug 26 12:07:40 ..... processing annotations GTF
Aug 26 12:07:40 ..... processing annotations GTF
Aug 26 12:07:40 ..... processing annotations GTF
Aug 26 12:08:03 ..... inserting junctions into the genome indices
Aug 26 12:08:04 ..... inserting junctions into the genome indices
Aug 26 12:08:09 ..... inserting junctions into the genome indices
Aug 26 12:13:12 ..... started 1st pass mapping
Aug 26 12:13:13 ..... finished 1st pass mapping
Aug 26 12:13:13 ..... inserting junctions into the genome indices
Aug 26 12:13:21 ..... started 1st pass mapping
Aug 26 12:13:22 ..... finished 1st pass mapping
Aug 26 12:13:22 ..... inserting junctions into the genome indices
Aug 26 12:13:45 ..... started 1st pass mapping
Aug 26 12:13:45 ..... finished 1st pass mapping
Aug 26 12:13:46 ..... inserting junctions into the genome indices
Aug 26 12:15:01 ..... started mapping
Aug 26 12:15:02 ..... finished mapping
Aug 26 12:15:02 ..... finished successfully
Aug 26 12:15:09 ..... started mapping
Aug 26 12:15:09 ..... finished mapping
Aug 26 12:15:09 ..... finished successfully
Aug 26 12:15:55 ..... started mapping
Aug 26 12:15:56 ..... finished mapping
Aug 26 12:15:56 ..... finished successfully
Thu 26 Aug 12:15:56 BST 2021
Counting...
[1] "2021-08-26 12:16:11 BST"
[1] "4.5e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/rds/general/user/xyz16/home/bob/zUMIs_output/SS3_human_2samples_test.final_annot.gtf"
Thu 26 Aug 12:16:40 BST 2021
[1] "loomR found"
Thu 26 Aug 12:16:44 BST 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-08-26 12:16:44 BST"
Thu 26 Aug 12:16:53 BST 2021

This is my YAML:

project: SS3_human_2samples_test
sequence_files:
  file1:
    name: /rds/general/user/xyz16/home/bob/non_demultiplexed/Undetermined_S0_L001_R1_001.fastq.gz
    base_definition:
    - cDNA(23-150)
    - UMI(12-19)
    find_pattern: ATTGCGCAATG
  file2:
    name: /rds/general/user/xyz16/home/bob/non_demultiplexed/Undetermined_S0_L001_R2_001.fastq.gz
    base_definition: 
    - cDNA(1-150)
  file3:
    name: /rds/general/user/xyz16/home/bob/non_demultiplexed/Undetermined_S0_L001_I1_001.fastq.gz
    base_definition: 
    - BC(1-8)
  file4:
    name: /rds/general/user/xyz16/home/bob/non_demultiplexed/Undetermined_S0_L001_I2_001.fastq.gz
    base_definition: 
    - BC(1-8)
reference:
  STAR_index: /rds/general/user/xyz16/home/bob/STAR_INDEX
  GTF_file: /rds/general/user/xyz16/home/bob/STAR_ref_genome_files/Homo_sapiens.GRCh38.104.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /rds/general/user/xyz16/home/bob/zUMIs_output
num_threads: 32
mem_limit: 100
filter_cutoffs:
  BC_filter:
    num_bases: 3
    phred: 20
  UMI_filter:
    num_bases: 3
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: ~
  automatic: yes
  BarcodeBinning: 0
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering 
zUMIs_directory: /rds/general/user/xyz16/home/bob/zUMIs
samtools_exec: samtools
pigz_exec: pigz
STAR_exec: STAR
Rscript_exec: Rscript

I think this is a potential problem with samtools but I am using the miniconda env that came with the package. Or maybe it is something else? I saw there was a similar issue #78 in the closed issues but that was due to a cDNA problem in the yaml and then a STAR problem but I don't think that's what's happening here. If you guys could advice, it would be great!! Thank you very much!!

cziegenhain commented 3 years ago

Hi,

I'm not familiar with the PBS job scheduler. Since the std out and std err are different files, it's a little bit hard to grasp at which step the first issues start happening.

To help diagnose this problem:

Best, Christoph

cziegenhain commented 2 years ago

Feel free to reopen the issue if you still need assistance.

roxyisat-rex commented 2 years ago

Hi Christoph

I am writing in hopes of reopening this issue. (Sorry about the delay in getting back, I was working on another project which was quite urgent). I re-ran with the test/ sample data provided in the usage link. Below is the error from terminal:

Loading anaconda3/personal
  Loading requirement: fix_unwritable_tmp fix_setxattr
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Error in uik(bccount$cellindex, bccount$cs/1000) : 
  Method is not applicable for such a small vector. Please give at least a 5 numbers vector
Calls: cellBC -> .cellBarcode_unknown -> .FindBCcut -> uik
Execution halted
Warning message:
NAs introduced by coercion 
Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes.txt")) : 
  File '/rds/general/user/xyz16/home/bob/zUMIs_output/SS3_human_2samples_testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/rds/general/user/xyz16/home/bob'
Execution halted
Loading required package: yaml
Loading required package: Matrix
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/rds/general/user/xyz16/home/bob/zUMIs_output/expression/SS3_human_2samples_test.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project,  : 
  File '/rds/general/user/xyz16/home/bob/zUMIs_output/SS3_human_2samples_testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/rds/general/user/xyz16/home/bob'
Execution halted

This is quite similar to the issue above (when I was using my own Smartseq 3 data) including

  1. the absence of the project.dgecounts.rds files.
  2. samtools view: writing to standard output failed: Broken pipe
  3. error in fread I think this has something to do with the samtools in the miniconda env. Please advice. Thank you very much!
roxyisat-rex commented 2 years ago

The other part of the terminal output is as such:

 You provided these parameters:
 YAML file: runExample_fix.yaml
 zUMIs directory:       /rds/general/user/xyz16/home/bob/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   20
 zUMIs version 2.9.7 

Thu 28 Oct 07:51:18 BST 2021
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
Thu 28 Oct 07:51:30 BST 2021
Mapping...
[1] "2021-10-28 07:51:33 BST"
Oct 28 07:51:33 ..... started STAR run
Oct 28 07:51:34 ..... loading genome
Oct 28 07:51:34 ..... processing annotations GTF
Oct 28 07:51:35 ..... inserting junctions into the genome indices
Oct 28 07:51:38 ..... started 1st pass mapping
Oct 28 07:53:39 ..... finished 1st pass mapping
Oct 28 07:53:39 ..... inserting junctions into the genome indices
Oct 28 07:53:43 ..... started mapping
Oct 28 07:55:48 ..... finished mapping
Oct 28 07:55:48 ..... finished successfully
Thu 28 Oct 07:55:48 BST 2021
Counting...
[1] "2021-10-28 07:56:16 BST"
Thu 28 Oct 07:56:16 BST 2021
[1] "loomR found"
Thu 28 Oct 07:56:19 BST 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-10-28 07:56:19 BST"
Thu 28 Oct 07:56:30 BST 2021

Also similar to when I was using my own data. Pausing at descriptive statistics. To answer your question previous: When I was using my own ss3 data, I did not get the *.filtered.tagged.unmapped.bam file in the output directory. I got the following output files:

  1. project.BCstats.txt (1.38 MB)
  2. project.filtered.tagged.Log.final.out (0KB)
  3. project.final_annot.gtf (1.2GB)
  4. project.zUMIs_runlog.txt (1KB)
  5. project.barcodes.txt (5KB)
  6. output directories include expression and stats, however both empty.

when using the test data describe above, I got:

  1. filtered.tagged.Aligned.out.bam (63.9MB)
  2. filtered.Aligned.toTranscriptome.out.bam (47MB)
  3. filtered.tagged.of.final.out (2kb)
  4. figered.tagged.log.out (22kb)
  5. fitered.tagged.log.progress.out (1kb)
  6. filtered.tagged.SJ.out.tab (263kb)
  7. t.filtered.tagged.unmapped.bam (50mb)
  8. .final_annot.gtf(24mb)
  9. .BC.stats.txt (1kb)

The cluster is Linux. Thank you!

roxyisat-rex commented 2 years ago

Feel free to reopen the issue if you still need assistance.

This is my YAML file for the test data:

project: SS3_human_2samples_test
sequence_files:
  file1:
    name: /rds/general/user/xyz16/home/bob/barcoderead_HEK.1mio.fq.gz
    base_definition:
    - BC(1-6)
    - UMI(7-16)
  file2:
    name: /rds/general/user/xyz16/home/bob/cDNAread_HEK.1mio.fq.gz
    base_definition: 
    - cDNA(1-50)
reference:
  STAR_index: /rds/general/user/xyz16/home/bob/hg38_chr22_STAR7
  GTF_file: /rds/general/user/xyz16/home/bob/GRCh38.95.chr22.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /rds/general/user/xyz16/home/bob
num_threads: 8
mem_limit: 20
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: ~
  automatic: yes
  BarcodeBinning: 0
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
samtools_exec: samtools 
zUMIs_directory: /rds/general/user/xyz16/home/bob/zUMIs
pigz_exec: pigz
STAR_exec: STAR
Rscript_exec: Rscript
read_layout: PE