sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

filtering error: broken pipe #204

Closed bettycatherine closed 4 years ago

bettycatherine commented 4 years ago

Hi,

Thanks for creating zUMIs, especally with split-seq protocols cause I am working with it.

However, I encountered some problems in the first step of using zUMIs, and I got the following error messages:

pigz: abort: write error on (Broken pipe) Error in eval(bysub, parent.frame(), parent.frame()) : object 'XC' not found Calls: cellBC -> [ -> [.data.table -> eval -> eval In addition: Warning message: In data.table::fread(bccount_file, header = FALSE, col.names = c("XC", : File '/gpfs/home/lvxue/softwares/zUMIs/test/test.BCstats.txt' has size 0. Returning a NULL data.table. Execution halted Warning message: In data.table::fread(paste(samtools, "view", filtered_bams[1], "| cut -f10 | head -n 1000"), : File '/tmp/Rtmp5GkGHP/file2e67b3b591ac' has size 0. Returning a NULL data.table.

The output on the screen was as follows:


Good news! A newer version of zUMIs is available at https://github.com/sdparekh/zUMIs


You provided these parameters: YAML file: test.yaml zUMIs directory: /gpfs/home/lvxue/softwares/zUMIs STAR executable /gpfs/home/lvxue/softwares/STAR-master/bin/Linux_x86_64/STAR samtools executable /gpfs/home/lvxue/softwares/samtools-1.10/samtools pigz executable /gpfs/home/lvxue/softwares/pigz-2.4/pigz Rscript executable /gpfs/software/R-3.6.0/bin/Rscript RAM limit: 0 zUMIs version 2.9.3d

Thu Jul 30 16:48:23 CST 2020 WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a. Filtering... Thu Jul 30 16:51:12 CST 2020 Mapping... [1] "2020-07-30 16:51:14 CST" NULL Thu Jul 30 16:51:15 CST 2020 Counting... [1] "2020-07-30 16:51:26 CST" Thu Jul 30 16:51:26 CST 2020 [1] "loomR found" Thu Jul 30 16:51:28 CST 2020 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2020-07-30 16:51:28 CST" Thu Jul 30 16:51:39 CST 2020

I searched the issues and could not find some thing similar to mine, so I post a new issue. My yaml file was like this:**

project: test sequence_files: file1: name: /gpfs/home/lvxue/softwares/zUMIs/test/1_L4_A001.R1.fastq.gz base_definition: cDNA(1-150) file2: name: /gpfs/home/lvxue/softwares/zUMIs/test/1_L4_A001.R2.fastq.gz base_definition:

This file was generated from the web, and I checked everything, did not work out what is the problem. I thought there is something wrong with pigz, and I set thread to 1, the program ran longger, however, I still got error messages:

pigz: abort: write error on (Broken pipe) Error in eval(bysub, parent.frame(), parent.frame()) : object 'XC' not found Calls: cellBC -> [ -> [.data.table -> eval -> eval In addition: Warning message: In data.table::fread(bccount_file, header = FALSE, col.names = c("XC", : File '/gpfs/home/lvxue/softwares/zUMIs/test/test.BCstats.txt' has size 0. Returning a NULL data.table. Execution halted Warning message: In data.table::fread(paste(samtools, "view", filtered_bams[1], "| cut -f10 | head -n 1000"), : File '/tmp/RtmpjAiOZh/filef8bb7e947972' has size 0. Returning a NULL data.table.

EXITING because of fatal PARAMETERS error: pGe.sjdbOverhang <=0 while junctions are inserted on the fly with --sjdbFileChrStartEnd or/and --sjdbGTFfile SOLUTION: specify pGe.sjdbOverhang>0, ideally readmateLength-1 Jul 31 11:12:47 ...... FATAL ERROR, exiting Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes.txt")) : File '/gpfs/home/lvxue/softwares/zUMIs/test/zUMIs_output/testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/gpfs/home/lvxue/softwares/zUMIs/test' Execution halted Loading required package: yaml Loading required package: Matrix Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called ‘loomR’ Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/gpfs/home/lvxue/softwares/zUMIs/test/zUMIs_output/expression/test.dgecounts.rds', probable reason 'No such file or directory' Execution halted Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, : File '/gpfs/home/lvxue/softwares/zUMIs/test/zUMIs_output/testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/gpfs/home/lvxue/softwares/zUMIs/test' Execution halted

Please someone give me some advice on what's next I should do, thank you very much!

Xue

cziegenhain commented 4 years ago

Hi Xue,

Sorry to hear that there are issues. It sounds like it might be an issue with the pipeline not keeping any input reads. How many reads do your input fastq files have? I would recommend for you to relax the barcode and UMI filtering cutoffs significantly for Split-seq!

Please also double check that you have write permissions in the output folder, just in case that might be an issue.

Best, Christoph

bettycatherine commented 4 years ago

Hi Christoph, Thank you for your quick reply! I have 20,825,846 reads in the fastq files, and I used the default setting of cutoffs for bc and umi (BC_filter: num_bases: 1 phred: 20 UMI_filter: num_bases: 1 phred: 20), I do have write permission in the output folder. TAT so what do you think can be the problem. I really think this is a small fastq file because I used the smallest one to test, I have much biger ones, how could I process those TAT.

Best, Xue

cziegenhain commented 4 years ago

Hi Xue,

For split-seq, maybe start with trying BC_filter: num_bases: 6 and UMI_filter: num_bases: 2 ! Also how does eg. fastQC look for your barcode/UMI read? I'm a bit unsure what would cause the "broken pipe" error with pigz, but I want to make sure that there are simply no reads leftover with the stringent default BC & UMI filtering.

Best, Christoph

bettycatherine commented 4 years ago

Hi Chris, I thought there may be some thing wrong with my data, so I try to test on the example files, and got another "broken pipe":

samtools view: writing to standard output failed: Broken pipe samtools view: error closing standard output: -1 Warning message: NAs introduced by coercion sh: samtools: command not found sh: samtools: command not found Error in gsub("SN:", "", chr) : object 'chr' not found Calls: .makeSAF ... .chromLengthFilter -> [ -> [.data.table -> eval -> eval -> gsub In addition: Warning message: In data.table::fread(bread, col.names = c("chr", "len"), header = F) : File '/tmp/Rtmptj1zUp/file1d7c75b4367c2' has size 0. Returning a NULL data.table. Execution halted Loading required package: yaml Loading required package: Matrix Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called ‘loomR’ Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/gpfs/home/lvxue/softwares/zUMIs/test//zUMIs_output/expression/Example.dgecounts.rds', probable reason 'No such file or directory' Execution halted Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/gpfs/home/lvxue/softwares/zUMIs/test//zUMIs_output/expression/Example.dgecounts.rds', probable reason 'No such file or directory' Execution halted

And I did not know why it could not find samtools since I give the absolute path of samtools. Though there was output this time:

-rw-r--r-- 1 69153696 Aug 5 11:05 Example.filtered.tagged.Aligned.out.bam -rw-r--r-- 1 51242985 Aug 5 11:05 Example.filtered.tagged.Aligned.toTranscriptome.out.bam -rw-r--r-- 1 1998 Aug 5 11:05 Example.filtered.tagged.Log.final.out -rw-r--r-- 1 14900 Aug 5 11:05 Example.filtered.tagged.Log.out -rw-r--r-- 1 364 Aug 5 11:05 Example.filtered.tagged.Log.progress.out -rw-r--r-- 1 250645 Aug 5 11:05 Example.filtered.tagged.SJ.out.tab drwx------ 2 4096 Aug 5 11:04 Example.filtered.tagged._STARgenome -rw-r--r-- 1 53303520 Aug 5 11:04 Example.filtered.tagged.unmapped.bam -rw-r--r-- 1 25392573 Aug 5 11:04 Example.final_annot.gtf

I am sure samtools works fine and this direcotry is writable. So what is the problem?

Best,

Xue

cziegenhain commented 4 years ago

Hi Xue,

Yes looks like samtools is not being found as samtools. Consider using the conda environment for using the dependencies or set the path to your samtools binary in the YAML file!

Best, C

bettycatherine commented 4 years ago

Hi C, I am confused because I am using Filtering at the "which_Stage", so shouldn't the output just be something like SplitSeq.filtered.tagged.unmapped.bam acording to your pipeline? And it is reasonable that no output for mapping came out, no matter what the error messages were? And when I go through the next stage it will do mapping until then? I do not know what is the standard output for each stage, or should I just using "Summarizing" to go through all the pipeline?

Best, Xue

cziegenhain commented 4 years ago

Hi Xue,

which_stage just specifies with which stage of the pipeline you start, and zUMIs will always try to run completely through to the end. So starting from Filtering means you do all steps from A to Z.

Best Christoph

bettycatherine commented 4 years ago

Hi C, Thank you for your reply, now I understand. I have a related question about split-seq, zUMIs have take the combined primer in the first step (random hexamer and polyT (dT)) into account, i.e. the shared barcodes, however, I am wondering how it process these share barcodes: simply merge them? Use some algorithm to combine those belong to the same cells? Would you please give me some info, so I can decide if I should use this method or handle these primers latter, thank you!

Best, Xue

cziegenhain commented 4 years ago

No problem, happy to help! For the barcode sharing, all reads of the barcodes belonging together will be combined. In the case of Split-seq, UMIs are derived independently from the N6 or dT reverse transcription, so essentially you could also sum up UMI counts after a "normal" zUMIs run without the bc sharing option.

bettycatherine commented 4 years ago

Hi C, I understand and thank you so much for coming along!

Best, Xue

EmmanuelCastro-3312 commented 4 years ago

Hi. @bettycatherine
I got the same pigz and samtools error (Broken pipe), how did you solve it?

Thanks. Emmanuel.

jon-xu commented 3 years ago

same pigz broken pipe error...

cziegenhain commented 3 years ago

Broken pipe can have many different causes, but to be able to help we need a proper description of what you are doing.

jon-xu commented 3 years ago

Hi Christoph,

Thanks for your reply! Actually I was trying to run the standard pipeline and got stopped at the filtering step. I am working on Quantz-seq2 data.

Here below are the error messages:

=========================================== (base) [uqjxu8@delta067 lsc]$ zUMIs/zUMIs.sh -y zUMIs/zUMIs.yaml& [1] 23741 (base) [uqjxu8@delta067 lsc]$

You provided these parameters: YAML file: zUMIs/zUMIs.yaml zUMIs directory: /scratch/90days/uqjxu8/lsc/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.9.5

Tue Mar 2 20:34:45 AEST 2021 WARNING: The STAR version used for mapping is 2.7.8a and the STAR index was created using the version . This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.8a. Filtering... pigz: abort: write error on (Broken pipe)

And the main part of the yaml file:

you can specify between 1 and 4 input files

sequence_files: file1: name: /home/uqjxu8/lsc/data/reads_for_zUMIs.R1.fastq.gz base_definition:

reference genome setup

reference: STAR_index: /home/uqjxu8/lsc/ref/ GTF_file: /home/uqjxu8/lsc/ref/Homo_sapiens.GRCh38.99.gtf exon_extension: no #extend exons by a certain width? extension_length: 0 #number of bp to extend exons by scaffold_length_min: 0 #minimal scaffold/chromosome length to consider (0 = all) additional_files: /home/uqjxu8/lsc/ref/RNAsequins.v2.2.txt #Optional parameter. It is possible to give additional reference sequences here, eg ERCC.fa additional_STAR_params: #Optional parameter. you may add custom mapping parameters to STAR here

output directory

out_dir: /home/uqjxu8/lsc/results #specify the full path to the output directory

Thanks! Jon

sdparekh commented 3 years ago

Hi Jon,

Did you make sure that all the dependencies are installed? https://github.com/sdparekh/zUMIs/wiki/Installation#dependencies

It looks like pigz is either not installed or has a different executable path/name. To avoid this I would suggest you to use the miniconda option of zUMIs. You just need to add "-c" to your zUMIs run command.

Let us know if you have further questions. best Swati

jon-xu commented 3 years ago

Thanks Swati!

Yes pigz is properly installed in the same conda environment. All dependencies are installed. I also tried -c option and it gave the same error. I was trying to use the docker option but our HPC administrator is hesitated on that...

jon-xu commented 3 years ago

My package information FYR:

samtools: v1.10 STAR: 2.7.8a R: 3.6.1 pigz: 2.4

sessionInfo() R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /opt/ohpc/pub/libs/gnu8/openblas/0.3.7/lib/libopenblasp-r0.3.7.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] ggrastr_0.2.3 plyranges_1.6.10 [3] GenomicAlignments_1.22.1 Rsamtools_2.2.3 [5] Biostrings_2.54.0 XVector_0.26.0 [7] SummarizedExperiment_1.16.1 DelayedArray_0.12.3 [9] BiocParallel_1.20.1 matrixStats_0.57.0 [11] GenomicFeatures_1.38.2 AnnotationDbi_1.48.0 [13] Biobase_2.46.0 GenomicRanges_1.38.0 [15] GenomeInfoDb_1.22.1 IRanges_2.20.2 [17] S4Vectors_0.24.4 BiocGenerics_0.32.0 [19] extraDistr_1.9.1 stringr_1.4.0 [21] data.table_1.14.0 stringdist_0.9.6.3 [23] devtools_2.3.2 usethis_2.0.1 [25] BiocManager_1.30.10 Matrix_1.3-2 [27] cowplot_1.1.1 dplyr_1.0.4 [29] mclust_5.4.7 ggplot2_3.3.3 [31] shinyBS_0.61 shinythemes_1.2.0 [33] shiny_1.6.0 yaml_2.2.1 [35] inflection_1.3.5

loaded via a namespace (and not attached): [1] bitops_1.0-6 fs_1.5.0 bit64_4.0.5 [4] progress_1.2.2 httr_1.4.2 rprojroot_2.0.2 [7] tools_3.6.1 utf8_1.1.4 R6_2.5.0 [10] vipor_0.4.5 DBI_1.1.1 colorspace_2.0-0 [13] withr_2.4.1 tidyselect_1.1.0 prettyunits_1.1.1 [16] processx_3.4.5 curl_4.3 bit_4.0.4 [19] compiler_3.6.1 cli_2.3.1 desc_1.2.0 [22] rtracklayer_1.46.0 scales_1.1.1 callr_3.5.1 [25] askpass_1.1 rappdirs_0.3.1 digest_0.6.27 [28] pkgconfig_2.0.3 htmltools_0.5.1.1 sessioninfo_1.1.1 [31] dbplyr_2.1.0 fastmap_1.0.1 rlang_0.4.10 [34] RSQLite_2.2.3 generics_0.1.0 RCurl_1.98-1.2 [37] magrittr_2.0.1 GenomeInfoDbData_1.2.2 ggbeeswarm_0.6.0 [40] Rcpp_1.0.6 munsell_0.5.0 fansi_0.4.2 [43] lifecycle_1.0.0 stringi_1.5.3 zlibbioc_1.32.0 [46] pkgbuild_1.2.0 BiocFileCache_1.10.2 grid_3.6.1 [49] blob_1.2.1 promises_1.1.1 crayon_1.4.1 [52] lattice_0.20-38 hms_1.0.0 ps_1.6.0 [55] pillar_1.5.0 biomaRt_2.42.1 pkgload_1.2.0 [58] XML_3.99-0.3 glue_1.4.2 remotes_2.2.0 [61] vctrs_0.3.6 httpuv_1.5.4 testthat_3.0.2 [64] gtable_0.3.0 openssl_1.4.3 purrr_0.3.4 [67] assertthat_0.2.1 cachem_1.0.3 mime_0.9 [70] xtable_1.8-4 later_1.1.0.1 tibble_3.1.0 [73] beeswarm_0.2.3 memoise_2.0.0 ellipsis_0.3.1

cziegenhain commented 3 years ago

Could you upload the full yaml as a text file so we can have a look at any potential indentation or formatting errors? It's always hard to see here in the github comments. Also would be great to get the full standard out / error as a text file.

jon-xu commented 3 years ago

Thanks Christoph, they're attached here.

zUMIs.yaml.txt errorwithout-c.txt errorwith-c.txt

cziegenhain commented 3 years ago

When you run without the conda environment it's visible that not all dependencies are present (eg. yaml R package) Of course this is not the issue in the run log with conda. I'm wondering if the error arises because you give 1 thread and no memory limit. That's very unusual, can you give something more realistic there? (Eg 16 threads and 50 GB RAM) Otherwise it could also always be an error that is because of your input data, in case some file is broken.

jon-xu commented 3 years ago

Thanks Christoph for the advice! I changed it to 16 threads with 50GB RAM and also checked all input data, nothing is broken.

Weird...

jon-xu commented 3 years ago

Hi Christoph,

I re-merged the reads per the instruction. And I ignored the "write error on (Broken pipe)" and let it continue, and ended up with a new series of errors:

You provided these parameters: YAML file: zUMIs/zUMIs.yaml zUMIs directory: /scratch/90days/uqjxu8/lsc/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: 50 zUMIs version 2.9.5

Fri Mar 12 11:23:26 AEST 2021 WARNING: The STAR version used for mapping is 2.7.8a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.8a. Filtering... pigz: abort: write error on (Broken pipe)

samtools view: writing to standard output failed: Broken pipe samtools view: error closing standard output: -1 Fri Mar 12 11:43:10 AEST 2021 [1] "1992 barcodes detected." [1] "3643534 reads were assigned to barcodes that do not correspond to intact cells." [1] "Found 2809 daughter barcodes that can be binned into 921 parent barcodes." [1] "Binned barcodes correspond to 763602 reads." Mapping... [1] "2021-03-12 11:45:20 AEST" Mar 12 11:45:23 ..... started STAR run Mar 12 11:45:23 ..... loading genome

EXITING because of fatal PARAMETERS error: present --sjdbOverhang=119 is not equal to the value at the genome generation step =100 SOLUTION:

Mar 12 11:45:23 ...... FATAL ERROR, exiting Fri Mar 12 11:45:37 AEST 2021 Counting... [1] "2021-03-12 11:45:47 AEST" [1] "2.25e+08 Reads per chunk" [1] "Loading reference annotation from:" [1] "/home/uqjxu8/lsc/results/GIH_LSC.final_annot.gtf" [main_samview] fail to read the header from "/home/uqjxu8/lsc/results/GIH_LSC.filtered.tagged.Aligned.out.bam". [main_samview] fail to read the header from "/home/uqjxu8/lsc/results/GIH_LSC.filtered.tagged.Aligned.out.bam". Error in gsub("SN:", "", chr) : object 'chr' not found Calls: .makeSAF ... .chromLengthFilter -> [ -> [.data.table -> eval -> eval -> gsub In addition: Warning message: In data.table::fread(bread, col.names = c("chr", "len"), header = F) : File '/var/tmp/pbs.195000.delta2/RtmpxFd9gH/file69de4af903b2' has size 0. Returning a NULL data.table. Execution halted Fri Mar 12 11:46:06 AEST 2021 Loading required package: yaml Loading required package: Matrix [1] "loomR found" Warning message: In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, : there is no package called ‘loomR’ Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/uqjxu8/lsc/results/zUMIs_output/expression/GIH_LSC.dgecounts.rds', probable reason 'No such file or directory' Execution halted Fri Mar 12 11:46:08 AEST 2021 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2021-03-12 11:46:08 AEST" Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/uqjxu8/lsc/results/zUMIs_output/expression/GIH_LSC.dgecounts.rds', probable reason 'No such file or directory' Execution halted

cziegenhain commented 3 years ago

Definitely an odd issue. If you can upload the files somewhere I would give it a try in my system.

Regarding continuing after the broken pipe message: seems like reads did get processed and you had barcodes detected, so that seems promising. The downstream error is just due to the STAR index, you have to use one without splice junction information: https://github.com/sdparekh/zUMIs/wiki/Usage#preparing-star-index-for-mapping Just a warning: It could be that your files did not get processed completely if you move on, i would advise checking the number of reads in the input file and the number of reads present in your bam files zUMIs produces.

jon-xu commented 3 years ago

That would be great! I would like to send it to your twitter account, but it didn't allow me to send direct message... Would you mind following back so that I can direct messaging you, please? Mine is @xujun_jon

cziegenhain commented 3 years ago

Hi Jon,

The data you had sent ran through fine for me meaning the issue must be somehow on your end. Happy to upload the output files somewhere if you want. jon.log.txt jon.yaml.txt