sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
274 stars 67 forks source link

Error with completion with bigger dataset #284

Closed mdhfz89 closed 3 years ago

mdhfz89 commented 3 years ago

Hi, I previously faced a problem with getting zUMIs to run but that has been solved. However, I'm facing a problem in getting zUMIs to complete with a bigger dataset. In my initial testing which led me to my previous problem, I used data from 1 of the 2 flowcells downloaded from SRA. In the current attempt, I concatenated the reads together to run them as a larger dataset. Here are the number of reads in the smaller vs the current run.

Smaller test run (1 of 2 flowcell)

Larger run (2 flowcells)

I have checked the larger run for the matching number of reads:

(base) hafiz@scelse:/datadrive03/hafiz/data/microSPLiT/03_2SRA_test$ zcat < 2SRA_1_filtered.fastq.gz | wc -l
881625504
(base) hafiz@scelse:/datadrive03/hafiz/data/microSPLiT/03_2SRA_test$ zcat < 2SRA_2_filtered.fastq.gz | wc -l
881625504

I initially ran the larger set on my own workstation and the run failed multiple times which led me to think that it might be a memory issue since the workstation only has 12 cores and 32GB ram. Therefore, I decided to run zUMIs on a server instead but am still facing the same problem multiple times. I have even progressively increased the available cores and memory like so:

Workstation

Server run 1

Server run 2

Below is the latest yaml file for server run 2

project: microsplitTest_2SRA
sequence_files:
  file1:
    name: 2SRA_1_filtered.fastq.gz
    base_definition: 
      - cDNA(1-76)
  file2:
    name: 2SRA_2_filtered.fastq.gz
    base_definition:
      - BC(11-18,49-56,79-86)
      - UMI(1-10)
reference:
  STAR_index: /datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_bsubgenome
  GTF_file: /datadrive03/hafiz/data/microSPLiT/03_2SRA_test/bsub.gtf
  additional_STAR_params: --alignIntronMax 1 --genomeSAindexNbases 10
  additional_files: ~
out_dir: /datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI
num_threads: 12
mem_limit: 256
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 10
barcodes:
  barcode_num: null
  barcode_file: null
  barcode_sharing: null
  automatic: yes
  BarcodeBinning: 1
  nReadsperCell: 100
  demultiplex: yes
counting_opts:
  introns: no
  downsampling: 0
  strand: 0
  Ham_Dist: 0
  write_ham: no
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

Below is the stdout from the parts that I see an error message:

Workstation runs

You provided these parameters:
 YAML file: microsplitTest_ubuntu.yaml
 zUMIs directory:       /home/hafiz/tools/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   28
 zUMIs version 2.9.7 

Tue Sep  7 13:24:32 +08 2021
WARNING: The STAR version used for mapping is 2.7.9a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.9a.
Filtering...
Tue Sep  7 13:58:27 +08 2021
[1] "16293 barcodes detected."
[1] "16665515 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 126 daughter barcodes that can be binned into 102 parent barcodes."
[1] "Binned barcodes correspond to 15883 reads."
Mapping...
[1] "2021-09-07 14:02:26 +08"
Warning message:
NAs introduced by coercion 
    STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_bsubgenome --sjdbGTFfile /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/bsub.gtf --runThreadN 10 --sjdbOverhang 73 --readFilesType SAM SE --alignIntronMax 1 --genomeSAindexNbases 10 --twopassMode Basic --readFilesIn /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaa.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAab.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAac.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAad.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAae.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaf.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAag.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAah.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAai.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaj.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAak.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAal.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAam.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAan.filtered.tagged.bam --outFileNamePrefix /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/microsplitTest_2SRA.filtered.tagged.
    STAR version: 2.7.9a   compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 07 14:02:26 ..... started STAR run
Sep 07 14:02:26 ..... loading genome
Sep 07 14:02:26 ..... processing annotations GTF
Sep 07 14:02:26 ..... inserting junctions into the genome indices
Sep 07 14:02:26 ..... started 1st pass mapping
Sep 07 14:11:29 ..... finished 1st pass mapping
Sep 07 14:11:29 ..... inserting junctions into the genome indices
Sep 07 14:11:30 ..... started mapping
Sep 07 14:24:04 ..... finished mapping
Sep 07 14:24:04 ..... finished successfully
Tue Sep  7 14:24:05 +08 2021
Counting...
[1] "2021-09-07 14:24:14 +08"
[1] "1.26e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/microsplitTest_2SRA.final_annot.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_2SRA.filtered.tagged.Aligne ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 12                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid25506 ...         ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_2SRA.filtered.tagged.Aligned.out.bam...    ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 75138166                                             ||
||    Successfully assigned alignments : 46431651 (61.8%)                     ||
||    Running time : 0.71 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-07 14:25:05 +08"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 12 files and 12 in-memory blocks...
[1] "2021-09-07 14:28:00 +08"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 16293 barcodes in this chunk..."
Error in rbindlist(rsamtools_reads, fill = TRUE, use.names = TRUE) : 
  Item 1 of input is not a data.frame, data.table or list
Calls: reads2genes_new -> rbindlist
In addition: Warning message:
In mclapply(1:nrow(idxstats), function(x) { :
  all scheduled cores encountered errors in user code
Execution halted
Tue Sep  7 14:28:01 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Tue Sep  7 14:28:03 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-07 14:28:03 +08"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Tue Sep  7 14:28:07 +08 2021

Server run 1

Rab Sep  8 11:55:23 +08 2021
Counting...
[1] "2021-09-08 11:55:31 +08"
[1] "5.76e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/microsplitTest_2SRA.final_annot.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_2SRA.filtered.tagged.Aligne ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 18                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid13090 ...         ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_2SRA.filtered.tagged.Aligned.out.bam...    ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 75138166                                             ||
||    Successfully assigned alignments : 46431651 (61.8%)                     ||
||    Running time : 0.44 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-08 11:56:04 +08"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 0 files and 18 in-memory blocks...
[1] "2021-09-08 11:58:16 +08"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 16293 barcodes in this chunk..."
Error in rbindlist(rsamtools_reads, fill = TRUE, use.names = TRUE) : 
  Item 1 of input is not a data.frame, data.table or list
Calls: reads2genes_new -> rbindlist
In addition: Warning message:
In mclapply(1:nrow(idxstats), function(x) { :
  all scheduled cores encountered errors in user code
Execution halted
Rab Sep  8 11:58:17 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Rab Sep  8 11:58:18 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-08 11:58:18 +08"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Rab Sep  8 11:58:22 +08 2021

Server run 2

Sep 08 13:02:12 ..... finished successfully
[W::bam_hdr_read] bgzf_check_EOF: Invalid argument
[E::bam_hdr_read] Invalid BAM binary header
[bam_cat] ERROR: couldn't read header for '/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/zUMIs_output/.tmpMap//tmp.microsplitTest_2SRA.10.Aligned.out.bam'.
Rab Sep  8 13:02:52 +08 2021
Counting...
[1] "2021-09-08 13:03:01 +08"
[1] "1.152e+09 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/microsplitTest_2SRA.final_annot.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_2SRA.filtered.tagged.Aligne ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 18                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid16032 ...         ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_2SRA.filtered.tagged.Aligned.out.bam...    ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 10433216                                             ||
||    Successfully assigned alignments : 5679619 (54.4%)                      ||
||    Running time : 0.06 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-08 13:03:11 +08"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 0 files and 18 in-memory blocks...
[1] "2021-09-08 13:03:33 +08"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 16293 barcodes in this chunk..."
Error in rbindlist(rsamtools_reads, fill = TRUE, use.names = TRUE) : 
  Item 1 of input is not a data.frame, data.table or list
Calls: reads2genes_new -> rbindlist
In addition: Warning message:
In mclapply(1:nrow(idxstats), function(x) { :
  all scheduled cores encountered errors in user code
Execution halted
Rab Sep  8 13:03:34 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Rab Sep  8 13:03:35 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-08 13:03:35 +08"
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/datadrive03/hafiz/data/microSPLiT/03_2SRA_test/03_zUMI/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Rab Sep  8 13:03:40 +08 2021

Hoping to find any suggestion here that can maybe let me run this to completion. Thanks for your help again.

mdhfz89 commented 3 years ago

Also, I can't seem to pull the docker for zUMIs so I can't test whether this is my installation issue or otherwise. I get this error:

sudo docker pull chrzie/zumis2

Using default tag: latest
Error response from daemon: manifest for chrzie/zumis2:latest not found: manifest unknown: 
mdhfz89 commented 3 years ago

I guess the problem stems from the UMIstuffFUN.R and data.table but I really am not sure how else to tackle this.

mdhfz89 commented 3 years ago

Just an update, I tried to rerun this with the smaller dataset and it does not work. I'm not sure what broke but it errors at the same "Counting" step. With that I tried to rollback the R packages to as close as possible to what was tested as successful in the zUMIs wiki but still got the same error at the same "Counting" step too.

Here's the latest stdout:

 You provided these parameters:
 YAML file: microsplitTest_ubuntu2.yaml
 zUMIs directory:       /home/hafiz/tools/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   28
 zUMIs version 2.9.7 

Thu Sep  9 10:00:00 +08 2021
WARNING: The STAR version used for mapping is 2.7.9a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.9a.
Filtering...
Thu Sep  9 10:34:08 +08 2021
Warning message:
replacing previous import ‘vctrs::data_frame’ by ‘tibble::data_frame’ when loading ‘dplyr’ 
[1] "16293 barcodes detected."
[1] "16665515 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 126 daughter barcodes that can be binned into 102 parent barcodes."
[1] "Binned barcodes correspond to 15883 reads."
Mapping...
[1] "2021-09-09 10:38:10 +08"
Warning message:
NAs introduced by coercion 
    STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_bsubgenome --sjdbGTFfile /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/bsub.gtf --runThreadN 10 --sjdbOverhang 73 --readFilesType SAM SE --alignIntronMax 1 --genomeSAindexNbases 10 --twopassMode Basic --readFilesIn /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaa.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAab.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAac.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAad.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAae.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaf.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAag.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAah.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAai.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAaj.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAak.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAal.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAam.filtered.tagged.bam,/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/.tmpMerge//microsplitTest_2SRA.microsplitTest_2SRAan.filtered.tagged.bam --outFileNamePrefix /home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/microsplitTest_2SRA.filtered.tagged.
    STAR version: 2.7.9a   compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Sep 09 10:38:10 ..... started STAR run
Sep 09 10:38:10 ..... loading genome
Sep 09 10:38:10 ..... processing annotations GTF
Sep 09 10:38:10 ..... inserting junctions into the genome indices
Sep 09 10:38:11 ..... started 1st pass mapping
Sep 09 10:47:17 ..... finished 1st pass mapping
Sep 09 10:47:17 ..... inserting junctions into the genome indices
Sep 09 10:47:19 ..... started mapping
Sep 09 10:59:57 ..... finished mapping
Sep 09 10:59:57 ..... finished successfully
Thu Sep  9 10:59:58 +08 2021
Counting...
Warning message:
replacing previous import ‘vctrs::data_frame’ by ‘tibble::data_frame’ when loading ‘dplyr’ 
[1] "2021-09-09 11:00:05 +08"
[1] "1.26e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/microsplitTest_2SRA.final_annot.gtf"
[1] "Annotation loaded!"
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_2SRA.filtered.tagged.Aligne ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 12                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid23271 ...         ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_2SRA.filtered.tagged.Aligned.out.bam...    ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 75138166                                             ||
||    Successfully assigned alignments : 46431651 (61.8%)                     ||
||    Running time : 0.73 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-09 11:00:58 +08"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 12 files and 12 in-memory blocks...
[1] "2021-09-09 11:03:54 +08"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 16293 barcodes in this chunk..."
Error in rbindlist(rsamtools_reads, fill = TRUE, use.names = TRUE) : 
  Item 1 of input is not a data.frame, data.table or list
Calls: reads2genes_new -> rbindlist
In addition: Warning message:
In mclapply(1:nrow(idxstats), function(x) { :
  all scheduled cores encountered errors in user code
Execution halted
Thu Sep  9 11:03:54 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Thu Sep  9 11:03:56 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-09 11:03:56 +08"
Warning message:
replacing previous import ‘vctrs::data_frame’ by ‘tibble::data_frame’ when loading ‘dplyr’ 
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/hafiz/Documents/Hafiz/microSPLiT/reads/03_2SRA_test/03_zUMI_2/zUMIs_output/expression/microsplitTest_2SRA.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Thu Sep  9 11:04:01 +08 2021
cziegenhain commented 3 years ago

Hi,

Sorry for the late answer, this issue slipped my attention. A couple of pointers: 30GB on workstation is potentially a bit tight. Regarding the runs of the server, I do not recommend to set the mem_limit close to the limit of the physical run. Just setting it to ~100 GB should do nicely even for very large datasets.

Does the automatically selected number of barcodes make sense?

I'm guessing from the name of your project that this is bacterial scRNA-seq with microSPLIT? I do not have experience with this, but there can be a lot of unexpected things happening that we did not account for when testing zUMIs. Would you mind sharing a small dataset that reproduces this error message along with the genome reference & gtf?

Best, Christoph

mdhfz89 commented 3 years ago

Hi Christoph,

Thanks for replying me. Had to be away from the lab due to Covid restrictions again thus my late email. I have provided 2 kinds of data here. A 1 million reads subset and also the data from 1 flowcell, both worked before but suddenly broke and can't work. I also provided the yaml file from each of the runs that worked before, together with the references and gtf in this dropbox link. I'm not too sure how else is best to send you these.

https://www.dropbox.com/sh/qzxz7012c6b18tr/AADuXMbOyKL50iV1trArx1yEa?dl=0

Thank you so much for checking these out. Also, do you know what is going on with the Docker? I wanted to test that but I can't even get the Docker downloaded.

Best regards, Hafiz

cziegenhain commented 3 years ago

Hi Hafiz,

I will take a look in the coming days.

cziegenhain commented 3 years ago

Hi Hafiz,

I just ran the datasets you had uploaded. For all tests, I just used zUMIs with the -c conda option. To run that, I just used the zUMIs conda environments' STAR to generate the index from your fasta file. ~/programs/zUMIs/zUMIs-env/bin/STAR --runMode genomeGenerate --runThreadN 12 --genomeDir bsubgenome_273a --genomeFastaFiles bsub.fasta --genomeSAindexNbases 10 --limitGenomeGenerateRAM 24000000000

I'm attaching the yaml files, where I stayed with the same settings you had used. In the future, I would definitely recommend for you to increase the cutoffs for the filtering of BC and UMI sequences, the defaults you used are very stringent and you will loose a lot of reads. This of course also always depends on the data quality of the sequencing run at hand.

This is the log for the small dataset:

~/programs/zUMIs/zUMIs.sh -c -y microsplitTest_ubuntu_1m.yaml 
Warning: YAML file doesn't include 'pigz_exec' option; setting to 'pigz'
Warning: YAML file doesn't include 'STAR_exec' option; setting to 'STAR'
Warning: YAML file doesn't include 'Rscript_exec' option; setting to 'Rscript'
Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file: microsplitTest_ubuntu_1m.yaml
 zUMIs directory:       /home/chris/programs/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   24
 zUMIs version 2.9.7 

ons 29 sep 2021 16:06:34 CEST
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
ons 29 sep 2021 16:06:39 CEST
[1] "37 barcodes detected."
[1] "6689 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 0 daughter barcodes that can be binned into 0 parent barcodes."
[1] "Binned barcodes correspond to 0 reads."
Warning message:
In min(hamming) : no non-missing arguments to min; returning Inf
Mapping...
[1] "2021-09-29 16:06:42 CEST"
Warning message:
NAs introduced by coercion 
Sep 29 16:06:42 ..... started STAR run
Sep 29 16:06:42 ..... loading genome
Sep 29 16:06:42 ..... processing annotations GTF
Sep 29 16:06:42 ..... inserting junctions into the genome indices
Sep 29 16:06:42 ..... started 1st pass mapping
Sep 29 16:06:48 ..... finished 1st pass mapping
Sep 29 16:06:48 ..... inserting junctions into the genome indices
Sep 29 16:06:49 ..... started mapping
Sep 29 16:06:57 ..... finished mapping
Sep 29 16:06:57 ..... finished successfully
ons 29 sep 2021 16:06:57 CEST
Counting...
[1] "2021-09-29 16:07:05 CEST"
[1] "1.08e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/chris/projects/zUMIs284/1M_subset/out/microsplitTest_ubuntu.final_annot.gtf"
[1] "Annotation loaded!"
Warning message:
`as_quosure()` requires an explicit environment as of rlang 0.3.0.
Please supply `env`.
This warning is displayed once per session. 
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_ubuntu.filtered.tagged.Alig ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 10                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid122526 ...        ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_ubuntu.filtered.tagged.Aligned.out.bam...  ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 449859                                               ||
||    Successfully assigned alignments : 266544 (59.3%)                       ||
||    Running time : 0.02 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-29 16:07:13 CEST"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 0 files and 10 in-memory blocks...
[1] "2021-09-29 16:07:13 CEST"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 37 barcodes in this chunk..."
[1] "Demultiplexing output bam file by cell barcode..."
[1] "Using python implementation to demultiplex."
[1] "2021-09-29 16:07:15 CEST"
[1] "Demultiplexing zUMIs bam file..."
[1] "Demultiplexing complete."
[1] "2021-09-29 16:07:17 CEST"
[1] "2021-09-29 16:07:17 CEST"
[1] "I am done!! Look what I produced.../home/chris/projects/zUMIs284/1M_subset/out/zUMIs_output/"
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  7290379 389.4   12296361 656.7  9403756 502.3
Vcells 12957074  98.9   29932066 228.4 29809857 227.5
ons 29 sep 2021 16:07:17 CEST
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%ons 29 sep 2021 16:07:20 CEST
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-29 16:07:20 CEST"
notch went outside hinges. Try setting notch=FALSE.
notch went outside hinges. Try setting notch=FALSE.
[1] "1.08e+08 Reads per chunk"
[1] "Extracting reads from bam file(s)..."
[1] "Working on chunk 1"
Warning message:
In `[.data.table`(data.table::fread(samfile, na.strings = c(""),  :
  Column 'GEin' does not exist to remove
          used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells 4409029 235.5    8421118 449.8  6999004 373.8
Vcells 7628320  58.2   12531327  95.7 10376106  79.2
ons 29 sep 2021 16:07:29 CEST

So there are two warnings concerning the barcode error correction and the statistics in the end (lack of intron information), but it runs through fine.

This is the log for the full dataset:

~/programs/zUMIs/zUMIs.sh -c -y microsplitTest_ubuntu.yaml 
Warning: YAML file doesn't include 'pigz_exec' option; setting to 'pigz'
Warning: YAML file doesn't include 'STAR_exec' option; setting to 'STAR'
Warning: YAML file doesn't include 'Rscript_exec' option; setting to 'Rscript'
Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file: microsplitTest_ubuntu.yaml
 zUMIs directory:       /home/chris/programs/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   24
 zUMIs version 2.9.7 

ons 29 sep 2021 16:12:07 CEST
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
ons 29 sep 2021 16:23:41 CEST
[1] "14016 barcodes detected."
[1] "15138734 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 114 daughter barcodes that can be binned into 93 parent barcodes."
[1] "Binned barcodes correspond to 14502 reads."
Mapping...
[1] "2021-09-29 16:25:48 CEST"
Warning message:
NAs introduced by coercion 
Sep 29 16:25:48 ..... started STAR run
Sep 29 16:25:48 ..... loading genome
Sep 29 16:25:48 ..... processing annotations GTF
Sep 29 16:25:48 ..... inserting junctions into the genome indices
Sep 29 16:25:48 ..... started 1st pass mapping
Sep 29 16:33:11 ..... finished 1st pass mapping
Sep 29 16:33:11 ..... inserting junctions into the genome indices
Sep 29 16:33:12 ..... started mapping
Sep 29 16:42:00 ..... finished mapping
Sep 29 16:42:00 ..... finished successfully
ons 29 sep 2021 16:42:01 CEST
Counting...
[1] "2021-09-29 16:42:08 CEST"
[1] "1.08e+08 Reads per chunk"
[1] "Loading reference annotation from:"
[1] "/home/chris/projects/zUMIs284/1FC_set/out/microsplitTest_ubuntu.final_annot.gtf"
[1] "Annotation loaded!"
Warning message:
`as_quosure()` requires an explicit environment as of rlang 0.3.0.
Please supply `env`.
This warning is displayed once per session. 
[1] "Assigning reads to features (ex)"

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.32.4

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           S microsplitTest_ubuntu.filtered.tagged.Alig ... ||
||                                                                            ||
||              Annotation : R data.frame                                     ||
||      Assignment details : <input_file>.featureCounts.bam                   ||
||                      (Note that files are saved to the output directory)   ||
||                                                                            ||
||      Dir for temp files : .                                                ||
||                 Threads : 10                                               ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||      Multimapping reads : counted                                          ||
||     Multiple alignments : primary alignment only                           ||
|| Multi-overlapping reads : not counted                                      ||
||   Min overlapping bases : 1                                                ||
||                                                                            ||
||          Chimeric reads : not counted                                      ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid130373 ...        ||
||    Features : 4539                                                         ||
||    Meta-features : 4536                                                    ||
||    Chromosomes/contigs : 1                                                 ||
||                                                                            ||
|| Process BAM file microsplitTest_ubuntu.filtered.tagged.Aligned.out.bam...  ||
||    Single-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 68675772                                             ||
||    Successfully assigned alignments : 40899323 (59.6%)                     ||
||    Running time : 0.55 minutes                                             ||
||                                                                            ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[1] "2021-09-29 16:42:47 CEST"
[1] "Coordinate sorting final bam file..."
[bam_sort_core] merging from 10 files and 10 in-memory blocks...
[1] "2021-09-29 16:44:35 CEST"
[1] "Here are the detected subsampling options:"
[1] "Automatic downsampling"
[1] "Working on barcode chunk 1 out of 1"
[1] "Processing 14016 barcodes in this chunk..."
[1] "Demultiplexing output bam file by cell barcode..."
[1] "Using python implementation to demultiplex."
[1] "2021-09-29 17:12:45 CEST"
[1] "Breaking up demultiplexing in 16 chunks. This may be because you have >10000 cells or a too low filehandle limit (ulimit -n)."
[1] "Demultiplexing zUMIs bam file..."
[1] "Demultiplexing complete."
[1] "2021-09-29 17:39:29 CEST"
[1] "2021-09-29 17:39:29 CEST"
[1] "I am done!! Look what I produced.../home/chris/projects/zUMIs284/1FC_set/out/zUMIs_output/"
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   8346393  445.8   12296361  656.7  12296361  656.7
Vcells 132291276 1009.4  539891338 4119.1 842345158 6426.6
ons 29 sep 2021 17:39:31 CEST
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%Transposing input data: loom file will show input columns (cells) as rows and input rows (genes) as columns 
This is to maintain compatibility with other loom tools 
  |======================================================================| 100%ons 29 sep 2021 17:39:48 CEST
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-09-29 17:39:49 CEST"
[1] "1.08e+08 Reads per chunk"
[1] "Extracting reads from bam file(s)..."
[1] "Working on chunk 1"
Warning message:
In `[.data.table`(data.table::fread(samfile, na.strings = c(""),  :
  Column 'GEin' does not exist to remove
           used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells  4441363 237.2   10963722  585.6   8680035  463.6
Vcells 34957190 266.8  542877470 4141.9 659124247 5028.8
ons 29 sep 2021 17:48:45 CEST

All the tests were run on my local workstation (48 threads / 128 GB RAM) running elementaryOS (ie Ubuntu). Since even the full dataset is really quite small, the RAM requirement is actually quite minimal and you shouldnt have issues with that at all.

Regarding the Docker image, I haven't checked or updated that in a while, but if you would like to use docker it would also just be as simple as creating a new ubuntu image and just running git clone https://github.com/sdparekh/zUMIs.git

Happy to upload the output if its usable for you?

From this I can't really see why you are getting an error, seems like zUMIs should be doing fine.

microsplitTest_ubuntu_1m.yaml.txt full_microsplitTest_ubuntu.yaml.txt

Best, Christoph

mdhfz89 commented 3 years ago

Hi Christoph,

I'm embarrassed to say that it did not occur to me that using the "-c" option would solve the issues I'm facing. It seems like I only face this issue using the packages/dependencies that I installed that zUMIs needs. I really wonder how different they are that led to errors I was facing. Thank you for your suggestion and help regarding this!

I understand your point regarding the cutoffs for the BC and UMIs. I will definitely do as you suggest. I'm currently just figuring out an analysis pipeline for microSPLIT that my lab is thinking of doing so was just leaving defaults where possible. I tried running with the less stringent cutoffs you suggested previously (BC_filter: num_bases: 5, phred: 20 and UMI_filter: num_bases: 4, phred: 20) for the full published dataset on my local server and was surprised at just how much more counts I'm getting. Is there a good way to determine a good one to use? Or just try and judge for myself?

Again, thank you for your help in testing and figuring out what worked. Really appreciate it.

Best regards, Hafiz

cziegenhain commented 3 years ago

Hi Hafiz,

No worries. Sometimes the dependencies can be a bit tricky, but glad if it just works with the conda enviroment.

I usually take a look at fastQC plots to decide on the cutoff and taking into account the length of the UMI or BC. So as you say, its a bit of a arbitrary judgement call. The main goal is to discard clearly unusable reads, so if you are in doubt you can always be on the lenient side. I am not sure how microSPLIT works, but if there is an expectation of what BC sequences are valid sequences that always also provides added confidence.

Best, Christoph

mdhfz89 commented 3 years ago

Hi Christoph,

That was exactly what I did for my "dephasing" step that I described in my previous query. Before using zUMI, I used cutadapt to filter away paired end reads that did not match a specific list of expected barcodes (BC1) as anchored 3' adapters in the read2 since that was where the barcodes and UMI were for microSPLIT. The structure of the read2 for microSPLIT (from 5' to 3') is:

UMI-spacer-BC3-spacer-BC2-spacer-BC1

That was the reason why I thought I could use the more stringent default cutoffs. But you're right that I should consider less stringent cutoffs as well. Thanks again for your input and help.

Cheers, Hafiz