sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

samtools in miniconda looks problematic #287

Closed roxyisat-rex closed 2 years ago

roxyisat-rex commented 2 years ago

Hi Christoph

I am writing in hopes of reopening issue #283 (Sorry about the delay in getting back, I was working on another project which was quite urgent). I re-ran with the test/ sample data provided in the usage link. Below is the error from terminal:

Loading anaconda3/personal
  Loading requirement: fix_unwritable_tmp fix_setxattr
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Error in uik(bccount$cellindex, bccount$cs/1000) : 
  Method is not applicable for such a small vector. Please give at least a 5 numbers vector
Calls: cellBC -> .cellBarcode_unknown -> .FindBCcut -> uik
Execution halted
Warning message:
NAs introduced by coercion 
Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes.txt")) : 
  File '/rds/general/user/xyz16/home/bob/zUMIs_output/SS3_human_2samples_testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/rds/general/user/xyz16/home/bob'
Execution halted
Loading required package: yaml
Loading required package: Matrix
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/rds/general/user/xyz16/home/bob/zUMIs_output/expression/SS3_human_2samples_test.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project,  : 
  File '/rds/general/user/xyz16/home/bob/zUMIs_output/SS3_human_2samples_testkept_barcodes.txt' does not exist or is non-readable. getwd()=='/rds/general/user/xyz16/home/bob'
Execution halted

This is quite similar to the issue above (when I was using my own Smartseq 3 data described in the previous issue) including

  1. the absence of the project.dgecounts.rds files.
  2. samtools view: writing to standard output failed: Broken pipe
  3. error in fread I think this error may originate from something to do with the samtools in the miniconda env.

The other part of the terminal output shows the below Also similar to when I was using my own data. Pausing at descriptive statistics.

 You provided these parameters:
 YAML file: runExample_fix.yaml
 zUMIs directory:       /rds/general/user/xyz16/home/bob/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   20
 zUMIs version 2.9.7 

Thu 28 Oct 07:51:18 BST 2021
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
Thu 28 Oct 07:51:30 BST 2021
Mapping...
[1] "2021-10-28 07:51:33 BST"
Oct 28 07:51:33 ..... started STAR run
Oct 28 07:51:34 ..... loading genome
Oct 28 07:51:34 ..... processing annotations GTF
Oct 28 07:51:35 ..... inserting junctions into the genome indices
Oct 28 07:51:38 ..... started 1st pass mapping
Oct 28 07:53:39 ..... finished 1st pass mapping
Oct 28 07:53:39 ..... inserting junctions into the genome indices
Oct 28 07:53:43 ..... started mapping
Oct 28 07:55:48 ..... finished mapping
Oct 28 07:55:48 ..... finished successfully
Thu 28 Oct 07:55:48 BST 2021
Counting...
[1] "2021-10-28 07:56:16 BST"
Thu 28 Oct 07:56:16 BST 2021
[1] "loomR found"
Thu 28 Oct 07:56:19 BST 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-10-28 07:56:19 BST"
Thu 28 Oct 07:56:30 BST 2021

To answer your question previous: When I was using my own ss3 data, I did not get the *.filtered.tagged.unmapped.bam file in the output directory. I got the following output files:

project.BCstats.txt (1.38 MB) project.filtered.tagged.Log.final.out (0KB) project.final_annot.gtf (1.2GB) project.zUMIs_runlog.txt (1KB) project.barcodes.txt (5KB) output directories include expression and stats, however both empty. when using the test data describe above, I got:

filtered.tagged.Aligned.out.bam (63.9MB) filtered.Aligned.toTranscriptome.out.bam (47MB) filtered.tagged.of.final.out (2kb) figered.tagged.log.out (22kb) fitered.tagged.log.progress.out (1kb) filtered.tagged.SJ.out.tab (263kb) t.filtered.tagged.unmapped.bam (50mb) .final_annot.gtf(24mb) .BC.stats.txt (1kb)

Here is my YAML for the sample data:

project: SS3_human_2samples_test
sequence_files:
  file1:
    name: /rds/general/user/xyz16/home/bob/barcoderead_HEK.1mio.fq.gz
    base_definition:
    - BC(1-6)
    - UMI(7-16)
  file2:
    name: /rds/general/user/xyz16/home/bob/cDNAread_HEK.1mio.fq.gz
    base_definition: 
    - cDNA(1-50)
reference:
  STAR_index: /rds/general/user/xyz16/home/bob/hg38_chr22_STAR7
  GTF_file: /rds/general/user/xyz16/home/bob/GRCh38.95.chr22.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /rds/general/user/xyz16/home/bob
num_threads: 8
mem_limit: 20
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: ~
  automatic: yes
  BarcodeBinning: 0
  nReadsperCell: 100
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 0
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
samtools_exec: samtools 
zUMIs_directory: /rds/general/user/xyz16/home/bob/zUMIs
pigz_exec: pigz
STAR_exec: STAR
Rscript_exec: Rscript
read_layout: PE

Your help and advice is much appreciated. Thank you very much!

Originally posted by @roxyisat-rex in https://github.com/sdparekh/zUMIs/issues/283#issuecomment-953735987

cziegenhain commented 2 years ago

It still seems to me that you are using your own installation of dependencies as opposed to the conda environment provided by zUMIs. (I'm inferring this from the verbose above where it says Loading anaconda3/personal). Be sure to call zUMIs.sh -c -y your_yaml.yaml

The pipeline seems to fail quite early so I'm not surprised that you do not get the expected outputs.

Best, Christoph

roxyisat-rex commented 2 years ago

It still seems to me that you are using your own installation of dependencies as opposed to the conda environment provided by zUMIs. (I'm inferring this from the verbose above where it says Loading anaconda3/personal). Be sure to call zUMIs.sh -c -y your_yaml.yaml

The pipeline seems to fail quite early so I'm not surprised that you do not get the expected outputs.

Best, Christoph

Dear Christoph

You were right about the issue! I always add the load module command because we are running on a cluster and if we don't, it doesn't have anything, so I always customarily put that line, and because it was before the minconda line, I thought it wouldn't confuse zUMIS but it it. However given zUMIs is using its own miniconda, I don't need to load anything anyways. Anyways, I did with out load anaconda/personal and I think it has ran successfully I think. The output directories looks about right (missing a couple) but I want to confirm with you because the terminal output is still includes a NA introduced by coercion . Here are the outputs of the terminal, and directories. Terminal:

Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file: runExample_fix.yaml
 zUMIs directory:       /rds/general/user/xyz16/home/bob/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   20
 zUMIs version 2.9.7 

Mon 15 Nov 17:03:17 GMT 2021
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
Mon 15 Nov 17:03:29 GMT 2021
[1] "43 barcodes detected."
[1] "193783 reads were assigned to barcodes that do not correspond to intact cells."
Mapping...
[1] "2021-11-15 17:03:37 GMT"
Nov 15 17:03:37 ..... started STAR run
Nov 15 17:03:37 ..... loading genome
Nov 15 17:03:37 ..... processing annotations GTF
Nov 15 17:03:38 ..... inserting junctions into the genome indices
Nov 15 17:03:40 ..... started 1st pass mapping
Nov 15 17:05:28 ..... finished 1st pass mapping
Nov 15 17:05:28 ..... inserting junctions into the genome indices
Nov 15 17:05:32 ..... started mapping
Nov 15 17:07:19 ..... finished mapping
Nov 15 17:07:19 ..... finished successfully
Mon 15 Nov 17:07:19 GMT 2021
Counting...
[1] "2021-11-15 17:07:33 GMT"
[1] "9e+07 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/rds/general/user/xyz16/home/bob/SS3_human_2samples_test.final_annot.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S SS3_human_2samples_test.filtered.tagged.Al ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 8                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid64477 ...         ||
||    Features : 7370                                                         ||
||    Meta-features : 1353                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file SS3_human_2samples_test.filtered.tagged.Aligned.out.b ... ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 853296                                               ||
||    Successfully assigned alignments : 20940 (2.5%)                         ||
||    Running time : 0.02 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "Assigning reads to features (in)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S SS3_human_2samples_test.filtered.tagged.Al ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 8                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid64477 ...         ||
||    Features : 4843                                                         ||
||    Meta-features : 677                                                     ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file SS3_human_2samples_test.filtered.tagged.Aligned.out.b ... ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 853296                                               ||
||    Successfully assigned alignments : 44361 (5.2%)                         ||
||    Running time : 0.02 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-11-15 17:07:45 GMT"
[1] "Coordinate sorting final bam file..."
[1] "2021-11-15 17:07:46 GMT"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 43 barcodes in this chunk..."
[1] "2021-11-15 17:07:52 GMT"
[1] "I am done!! Look what I produced.../rds/general/user/xyz16/home/bob/zUMIs_output/"
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  7323890 391.2   12490967 667.1 12490967 667.1
Vcells 13389077 102.2   42460238 324.0 40068505 305.7
Mon 15 Nov 17:07:53 GMT 2021
[1] "loomR found"
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%Mon 15 Nov 17:08:01 GMT 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-11-15 17:08:01 GMT"
[1] "9e+07 Reads per chunk"
[1] "Extracting reads from bam file(s)..."
[1] "Working on chunk 1"
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 4411222 235.6    8420480 449.8  8376218 447.4
Vcells 8120869  62.0   23065286 176.0 22913835 174.9
Mon 15 Nov 17:08:17 GMT 2021
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1
Warning message:
NAs introduced by coercion 
Warning message:
`as_quosure()` requires an explicit environment as of rlang 0.3.0.
Please supply `env`.
This warning is displayed once per session. 
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Loading required package: yaml
Loading required package: Matrix
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 

And below are the output directories:

  1. final-annot.gtf
  2. filtered.tagged.unmapped.bam
  3. tagged.SJ.out.tab
  4. filtered.tagged.Log.progress.out
  5. filtered.tagged.Log.out
  6. filtered.tagged.Log.final.out
  7. filtered.tagged.aligned.toTranscriptome.out.bam
  8. filtered.tagged.aligned.out.bam
  9. filtered.aligned.genetagged.sorted.bam.bai
  10. filtered.alighed.geneTagged.sorted.bam I did not get the .filtered.tagged.Aligned.out.bam.ex.featureCounts.bam and .filtered.tagged.Aligned.out.bam.in.featureCounts.bam though.

and within the zUMIS_outout directory: I did not get the .currentRGgroup.txt file. And within the sub-directories, Stats: the readspercell.pdf is missing? only the readspercell.txt is present. Expression: .dgecounts.rds .gene_names.txt and a bunch of .loom files, which I am too not 100% sure what to make of those, especially those .inex.x.loom files. it would be very helpful to hear your thoughts! And sorry for the bother. Thank you very much!

Best, Roxy

cziegenhain commented 2 years ago

Hi,

The .featureCounts.bam files are not expected output anymore, that relates to very old versions of zUMIs. The geneTagged.sorted.bam includes both the exon and intron tags: https://github.com/sdparekh/zUMIs/wiki/Output

the same goes for .readspercell.pdf - we have switched it off because it was hard to interpret for large datasets.

The loom files correspond to each of the quantifications that are all present in the single .dgecounts.rds output file. https://github.com/sdparekh/zUMIs/wiki/Output#structure-of-the-output-dgecounts-object-in-project_namedgecountsrds We offer loom for python users as it is easier to load than the .rds files.

Best, Christoph