Processing microSPLIT data from SRA using zUMI

mdhfz89 commented 3 years ago

Hi,

I'm facing some problems with zUMI when trying to process microSPLIT reads downloaded from SRA. I'm hoping that I can get some clues on how to resolve the issue I'm facing here. microSPLIT (doi: 10.1126/science.aba5257) is the bacterial version of SPLITseq that uses combinatorial indexing. In the paper, they worked on Bacillus subtilis so the STAR commands were set up with that in mind. The fasta and gff file used are attached here.

bsub.zip

To start with, I downloaded the data using and then subsampled the reads to work with a smaller test dataset

fastq-dump -v --split-files --origfmt --defline-qual '+' --gzip SRR11940660
for x in *.fastq.gz; do gunzip -k ${x}; done
seqtk sample -s1808 SRR11940660_1.fastq 10000 > sub1.fastq
seqtk sample -s1808 SRR11940660_2.fastq 10000 > sub2.fastq

I then dephased the reads by removing any pairs that the right 8 bases of read 2 that do not match the expected barcode sequence (the expected index is also attached). Since cutadapt outputs each match into an individual file based on the name of the expected sequence in "BC1_dephasing.fasta", I then concatenated the matches. I used a wrapper script to achieve this.

BC1_dephasing.zip

yes Y | . dephasingWithCutadapt_ubuntu_nostops.sh sub_2.fastq sub_1.fastq

#variables here
OUTFOLDER="01_cutadapt_dephasing"
IN1=${1}
IN2=${2}

#to activate cutadapt conda environment
[ -z "${CONDA_EXE}" ] && echo "Error, missing \$CONDA_EXE !" && exit 1
CONDA_BASE=$(${CONDA_EXE} info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate cutadaptenv || exit 1

#dephase reads
cutadapt -a file:BC1_dephasing.fasta \
    -e 0.1 --no-indels --action=none \
    --untrimmed-output=./$OUTFOLDER/${IN1%.fastq}_nomatch.fastq --untrimmed-paired-output=./$OUTFOLDER/${IN2%.fastq}_nomatch.fastq \
    --pair-filter any --report=minimal\
    -o ./$OUTFOLDER/${IN1%.fastq}_{name}.fastq -p ./$OUTFOLDER/${IN2%.fastq}_{name}.fastq \
    ${IN1} ${IN2}
conda deactivate

#concatenate and remove demultiplexed reads
cd $OUTFOLDER
cat ${IN2%.fastq}_BC* > ${IN2%.fastq}_filtered.fastq
rm ${IN2%.fastq}_BC*
cat ${IN1%.fastq}_BC* > ${IN1%.fastq}_filtered.fastq
rm ${IN1%.fastq}_BC*
mkdir unwantedReads
mv *_nomatch.fastq ./unwantedReads
mv *_filtered.fastq ../
cd ..

The problem comes when I'm trying to run zUMI with these files and I think one error sends the whole process into being unable to complete.

The stdout that I'm seeing is:

bash /home/hafiz/tools/zUMIs/zUMIs.sh -d /home/hafiz/tools/zUMIs -y microsplitTest_ubuntu.yaml


 You provided these parameters:
 YAML file: microsplitTest_ubuntu.yaml
 zUMIs directory:       /home/hafiz/tools/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   6
 zUMIs version 2.9.7 

Tue Aug 17 09:43:57 +08 2021
WARNING: The STAR version used for mapping is 2.7.9a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.9a.
Filtering...
sh: 1: Syntax error: Unterminated quoted string
sh: 1: Syntax error: Unterminated quoted string
Tue Aug 17 09:43:58 +08 2021
Error in eval(bysub, parent.frame(), parent.frame()) : 
  object 'XC' not found
Calls: cellBC -> [ -> [.data.table -> eval -> eval
In addition: Warning message:
In data.table::fread(bccount_file, header = FALSE, col.names = c("XC",  :
  File '/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest/microsplitTest_ubuntu.BCstats.txt' has size 0. Returning a NULL data.table.
Execution halted
Mapping...
[1] "2021-08-17 09:43:59 +08"
Warning message:
NAs introduced by coercion 
Warning message:
In data.table::fread(cmd = paste(samtools, "view", filtered_bams[1],  :
  File '/tmp/Rtmp6uTshx/file2eec2c1d95e9' has size 0. Returning a NULL data.table.

Exiting because of *FATAL ERROR*: could not create FIFO file /media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest/microsplitTest_ubuntu.filtered.tagged._STARtmp/tmp.fifo.read1
SOLUTION: check the if run directory supports FIFO files.
If run partition does not support FIFO (e.g. Windows partitions FAT, NTFS), re-run on a Linux partition, or point --outTmpDir to a Linux partition.

Aug 17 09:43:59 ...... FATAL ERROR, exiting
Tue Aug 17 09:43:59 +08 2021
Counting...
[1] "2021-08-17 09:44:06 +08"
Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes_binned.txt")) : 
  File '/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/microsplitTest_ubuntukept_barcodes_binned.txt' does not exist or is non-readable. getwd()=='/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest'
Execution halted
Tue Aug 17 09:44:06 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/expression/microsplitTest_ubuntu.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Tue Aug 17 09:44:08 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-08-17 09:44:08 +08"
Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project,  : 
  File '/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/microsplitTest_ubuntukept_barcodes.txt' does not exist or is non-readable. getwd()=='/media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest'
Execution halted
Tue Aug 17 09:44:12 +08 2021

So I think that filtering step failed due to some syntax error but I tried to figure out what was going on and I can't seem to figure it out. I thought that the version of zUMI I downloaded had a problem so I re-downloaded it but am still facing the same issue.

The yaml file I used is

project: microsplitTest_ubuntu
sequence_files:
  file1:
    name: sub_1_filtered.fastq
    base_definition: 
      - cDNA(1-76)
  file2:
    name: sub_2_filtered.fastq
    base_definition:
      - BC(11-18,49-56,79-86)
      - UMI(1-10)
reference:
  STAR_index: /media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_bsubgenome
  GTF_file: /media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/bsub.gff3
  additional_STAR_params: --alignIntronMax 1 --genomeSAindexNbases 10
  additional_files: ~
out_dir: /media/hafiz/T3/microSPLiT/reads/00_pipeline_testing/03_zUMItest
num_threads: 2
mem_limit: 6
filter_cutoffs:
  BC_filter:
    num_bases: 1
    phred: 20
  UMI_filter:
    num_bases: 1
    phred: 10
barcodes:
  barcode_num: null
  barcode_file: null
  barcode_sharing: null
  automatic: yes
  BarcodeBinning: 1
  nReadsperCell: 100
  demultiplex: yes
counting_opts:
  introns: no
  downsampling: 0
  strand: 0
  Ham_Dist: 0
  write_ham: no
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

Looking forward to any suggestions you might have. If you need any files, I'll try to provide them here.

cziegenhain commented 3 years ago

Hi,

I'm not really familiar with microSPLIT and lack the time to dive into your description of the cutadapt / dephasing steps so will not comment on that now.

I know we have had issues with fastq input that is not gzipped. That would be my first recommendation to change, please gzip the input fastq files.

Some other comments on your yaml file (not related to this particular error but might lead to issues later):

you are providing a gff3 annotation, that will not work. please provide a correctly formatted gtf file.
you are filtering BC and UMI sequences with the very strict default, given that split-seq has the BC/UMI in read2 and the sequences are very long (especially BC), you should select much more lenient cutoffs there
Depending on the number of reads & cells present in your test dataset, the selected number of CPU threads and memory might lead to quite slow runtime

Best, Christoph

mdhfz89 commented 3 years ago

Hi Christoph,

I did what you recommended and gzipped my files and I think I made the BC cutoff more lenient. Also converted the gff3 to a gtf format using agat.

agat_convert_sp_gff2gtf.pl --gff bsub.gff3 -o bsub.gtf

However I'm still facing some kind of error here.

bash /home/hafiz/tools/zUMIs/zUMIs.sh -d /home/hafiz/tools/zUMIs -y microsplitTest_ubuntu.yaml

 You provided these parameters:
 YAML file: microsplitTest_ubuntu.yaml
 zUMIs directory:       /home/hafiz/tools/zUMIs
 STAR executable        STAR
 samtools executable        samtools
 pigz executable        pigz
 Rscript executable     Rscript
 RAM limit:   24
 zUMIs version 2.9.7 

Fri Aug 20 10:39:18 +08 2021
WARNING: The STAR version used for mapping is 2.7.9a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.9a.
Filtering...
Fri Aug 20 10:39:22 +08 2021
Error in uik(bccount$cellindex, bccount$cs/1000) : 
  Method is not applicable for such a small vector. Please give at least a 5 numbers vector
Calls: cellBC -> .cellBarcode_unknown -> .FindBCcut -> uik
Execution halted
Mapping...
[1] "2021-08-20 10:39:23 +08"
Warning message:
NAs introduced by coercion 
    STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_bsubgenome --sjdbGTFfile /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/bsub.gtf --runThreadN 8 --sjdbOverhang 73 --readFilesType SAM SE --alignIntronMax 1 --genomeSAindexNbases 10 --twopassMode Basic --readFilesIn /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/.tmpMerge//microsplitTest_ubuntu.microsplitTest_ubuntuaa.filtered.tagged.bam --outFileNamePrefix /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest/microsplitTest_ubuntu.filtered.tagged.
    STAR version: 2.7.9a   compiled: 2021-05-04T09:43:56-0400 vega:/home/dobin/data/STAR/STARcode/STAR.master/source
Aug 20 10:39:24 ..... started STAR run
Aug 20 10:39:24 ..... loading genome
Aug 20 10:39:24 ..... processing annotations GTF
Aug 20 10:39:24 ..... inserting junctions into the genome indices
Aug 20 10:39:24 ..... started 1st pass mapping
Aug 20 10:39:27 ..... finished 1st pass mapping
Aug 20 10:39:27 ..... inserting junctions into the genome indices
Aug 20 10:39:27 ..... started mapping
Aug 20 10:39:30 ..... finished mapping
Aug 20 10:39:30 ..... finished successfully
Fri Aug 20 10:39:30 +08 2021
Counting...
[1] "2021-08-20 10:39:38 +08"
Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes_binned.txt")) : 
  File '/home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/microsplitTest_ubuntukept_barcodes_binned.txt' does not exist or is non-readable. getwd()=='/home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest'
Execution halted
Fri Aug 20 10:39:38 +08 2021
Loading required package: yaml
Loading required package: Matrix
[1] "loomR found"
Error in gzfile(file, "rb") : cannot open the connection
Calls: rds_to_loom -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '/home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/expression/microsplitTest_ubuntu.dgecounts.rds', probable reason 'No such file or directory'
Execution halted
Fri Aug 20 10:39:40 +08 2021
Descriptive statistics...
[1] "I am loading useful packages for plotting..."
[1] "2021-08-20 10:39:40 +08"
Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project,  : 
  File '/home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest/zUMIs_output/microsplitTest_ubuntukept_barcodes.txt' does not exist or is non-readable. getwd()=='/home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest'
Execution halted
Fri Aug 20 10:39:44 +08 2021

Below is the new yaml file

project: microsplitTest_ubuntu
sequence_files:
  file1:
    name: sub_1_filtered.fastq.gz
    base_definition: 
      - cDNA(1-76)
  file2:
    name: sub_2_filtered.fastq.gz
    base_definition:
      - BC(11-18,49-56,79-86)
      - UMI(1-10)
reference:
  STAR_index: /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_bsubgenome
  GTF_file: /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/bsub.gtf
  additional_STAR_params: --alignIntronMax 1 --genomeSAindexNbases 10
  additional_files: ~
out_dir: /home/hafiz/Documents/Hafiz/microSPLiT/reads/00_pipeline_testing/03_zUMItest
num_threads: 10
mem_limit: 24
filter_cutoffs:
  BC_filter:
    num_bases: 2
    phred: 20
  UMI_filter:
    num_bases: 2
    phred: 10
barcodes:
  barcode_num: null
  barcode_file: null
  barcode_sharing: null
  automatic: yes
  BarcodeBinning: 1
  nReadsperCell: 100
  demultiplex: yes
counting_opts:
  introns: no
  downsampling: 0
  strand: 0
  Ham_Dist: 0
  write_ham: no
  velocyto: no
  primaryHit: yes
  twoPass: yes
make_stats: yes
which_Stage: Filtering
Rscript_exec: Rscript
STAR_exec: STAR
pigz_exec: pigz
samtools_exec: samtools

Thank you so much for your suggestions previously.

cziegenhain commented 3 years ago

Hi,

It's the automatic cell barcode detection that fails. This happens when there is basically no reads left after filtering or not at least a few distinct barcode sequences.

Maybe you went a bit small with your test data set?

mdhfz89 commented 3 years ago

Hi Christoph,

You're right. I retried it with a bigger subset and it works. Thank you so much for your help.

Cheers

sdparekh / zUMIs

Processing microSPLIT data from SRA using zUMI #279