sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

BUG: unable to run zUMI's #325

Closed LJK1991 closed 2 years ago

LJK1991 commented 2 years ago

I am trying to run zUMI's on a SPLiT-seq dataset and several errors occur and no results are produced, I cannot figure out the source.

I have created a conda environment with all the recommended settings from the Wiki. Additionally i have tried running zUMIs in the provided conda channel with the '-c' option. Both provide the same result.

Below in the image you see the error messages that are produced while running zUMIs. image The script continues to run but produces errors at every step, most likely due to the fact that it goes wrong in the beginning.

The first error (sh: 1: Syntax error: Unterminated quoted string) i have localized to the fqfilter_v2.pl on line 92 $oriBase = basename $oriF'; When removing the ' from the script it removes that specific error but does not remove any of the errors below that.

When investigating the next error I am not sure whether something goes wrong and that is the cause of not finding the 'XC' object or that there is something wrong the XC object itself.

The YAML file i use is here:

project: zUMI_benchmark sequence_files: file1: name: /media/draco/lucask/AlgorithmTest/SRR6750041_1.fastq base_definition: (1-66) file2: name: /media/draco/lucask/AlgorithmTest/SRR6750041_2.fastq base_definition:

The whitelist consist of all of the possible barcode combinations from the SPLiT-seq protocol (~880k possibilities) and the Shared_BC.txt consist of the randHex and OligodT barcodes that should be paired formatted as described in the wiki.

I have used different STAR indexes, from both 2.5.4 (the recommended) as well as 2.7.10 (the latest).

I am not sure what i am doing wrong, maybe it is something simple that I am just missing, could you please help me.

Kind regards, Lucas Kuijpers

cziegenhain commented 2 years ago

Hi,

Please gzip your fastq files, plain fastq file mode is depreciated.

The rest should hopeful just be downstream errors. Best, C

27 juli 2022 kl. 09:15 skrev Lucas @.***>:

 I am trying to run zUMI's on a SPLiT-seq dataset and several errors occur and no results are produced, I cannot figure out the source.

I have created a conda environment with all the recommended settings from the Wiki. Additionally i have tried running zUMIs in the provided conda channel with the '-c' option. Both provide the same result.

Below in the image you see the error messages that are produced while running zUMIs.

The script continues to run but produces errors at every step, most likely due to the fact that it goes wrong in the beginning.

The first error (sh: 1: Syntax error: Unterminated quoted string) i have localized to the fqfilter_v2.pl on line 92 $oriBase = basename $oriF'; When removing the ' from the script it removes that specific error but does not remove any of the errors below that.

When investigating the next error I am not sure whether something goes wrong and that is the cause of not finding the 'XC' object or that there is something wrong the XC object itself.

The YAML file i use is here:

project: zUMI_benchmark sequence_files: file1: name: /media/draco/lucask/AlgorithmTest/SRR6750041_1.fastq base_definition: (1-66) file2: name: /media/draco/lucask/AlgorithmTest/SRR6750041_2.fastq base_definition:

  • BC(11-18,48-56,87-94)
  • UMI(1-10) reference: STAR_index: /media/draco/lucask/genomes/mouse/M28_ALL/M28_ALL_STAR2-5-4/ GTF_file: /media/draco/lucask/genomes/mouse/M28_ALL/gencode.vM28.basic.annotation.gtf additional_STAR_params: '' additional_files: ~ out_dir: /media/draco/lucask/AlgorithmTest/zUMI/ num_threads: 8 mem_limit: 0 filter_cutoffs: BC_filter: num_bases: 1 phred: 20 UMI_filter: num_bases: 1 phred: 20 barcodes: barcode_num: ~ barcode_file: /media/draco/lucask/AlgorithmTest/zUMI/allBC_Whitelist.txt automatic: no BarcodeBinning: 2 nReadsperCell: 100 barcode_sharing: /media/draco/lucask/AlgorithmTest/zUMI/Shared_BC.txt counting_opts: introns: yes downsampling: '0' strand: 0 Ham_Dist: 1 velocyto: no primaryHit: yes twoPass: yes make_stats: yes which_Stage: Filtering Rscript_exec: Rscript STAR_exec: STAR pigz_exec: pigz samtools_exec: samtools

The whitelist consist of all of the possible barcode combinations from the SPLiT-seq protocol (~880k possibilities) and the Shared_BC.txt consist of the randHex and OligodT barcodes that should be paired formatted as described in the wiki.

I have used different STAR indexes, from both 2.5.4 (the recommended) as well as 2.7.10 (the latest).

I am not sure what i am doing wrong, maybe it is something simple that I am just missing, could you please help me.

Kind regards, Lucas Kuijpers

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

LJK1991 commented 2 years ago

using the zipped files did remove the errors, however now i get new errors.

You provided these parameters:
 YAML file: /media/draco/lucask/AlgorithmTest/zUMI/zUMI_benchmark.yaml
 zUMIs directory:       /media/draco/lucask/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   0
 zUMIs version 2.9.7c 

vr jul 29 14:01:40 CEST 2022
WARNING: The STAR version used for mapping is 2.7.10a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.10a.
Filtering...
vr jul 29 14:22:20 CEST 2022
[1] "Warning! None of the annotated barcodes were detected."
[1] "Continuing with top 100 barcodes instead..."
[1] "14883152 reads were assigned to barcodes that do not correspond to intact cells."
[1] "Found 2771 daughter barcodes that can be binned into 99 parent barcodes."
[1] "Binned barcodes correspond to 2337690 reads."
Warning message:
In BCbin(bccount_file = paste0(opt$out_dir, "/", opt$project, ".BCstats.txt"),  :
  NAs introduced by coercion
Mapping...
[1] "2022-07-29 14:24:26 CEST"
!!!!! WARNING:  Could not ls NA

EXITING: because of fatal INPUT file error: could not open read file: NA
SOLUTION: check that this file 

again, I am not sure what is going wrong. The above output is followed by more downstream errors. I provided the yaml again.

project: zUMI_benchmark
sequence_files:
  file1:
    name: /media/draco/lucask/AlgorithmTest/SRR6750041_1.fastq.gz
    base_definition: 
    - cDNA(1-100)
  file2:
    name: /media/draco/lucask/AlgorithmTest/SRR6750041_2.fastq.gz
    base_definition:
    - BC(11-18,49-56,86-94)
    - UMI(1-10)
reference:
  STAR_index: /media/draco/lucask/genomes/mouse/M28_ALL/M28_ALL_zUMI/
  GTF_file: /media/draco/lucask/genomes/mouse/M28_ALL/gencode.vM28.basic.annotation.gtf
  additional_STAR_params: ''
  additional_files: ~
out_dir: /media/draco/lucask/AlgorithmTest/zUMI/
num_threads: 8
mem_limit: 0
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: /media/draco/lucask/AlgorithmTest/zUMI/allBC_Whitelist.txt
  automatic: no
  BarcodeBinning: 2
  nReadsperCell: 100
  barcode_sharing: /media/draco/lucask/AlgorithmTest/zUMI/Shared_BC.txt
counting_opts:
  introns: yes
  downsampling: '0'
  strand: 0
  Ham_Dist: 1
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

The YAMLerror.log is empty:

$file1
NULL

$file2
NULL

[1] ""
[1] ""
[1] "" ""
[1] ""
[1] ""
[1] "" ""
[1] "NULL" "NULL"
$file1
NULL

$file2
NULL

$file1
NULL

$file2
NULL

[1] 0

The data set i am using is https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3017260, from the original SPLiT-seq paper. There are supposedly only 100 nuclei in there. Is it possible to use zUMI's on such a small data set? I am trying to compare several pipelines that allow SPLiT-seq data to look at their performance.

Thank you for helping.

LJK1991 commented 2 years ago

Hi cziegan, I have managed to reduce the errors, by not using the binning option, as it seems the source of the first error. However in both cases, with or without binning, i get the same error messages with binning image without binning image

Additionally, during the mapping i get this error image Saying my .fasta file is wrong. I have used several other tools and aligners which had no problem with it at all. Hopefully you can help.

Thanks in advance and kind regards, Lucas

cziegenhain commented 2 years ago

Hi,

Sorry for the slow reply. Seems to me something is wrong in the barcode_sharing settings, maybe good to double check the input there or also try running without the feature set to test

As for the mapping error using STAR, this is a common issue with SRA data, please check here how to fastq-dump and avoid the oddly formatted fastq header lines: https://github.com/sdparekh/zUMIs/wiki/Reprocessing-of-public-data

Best, Christoph