sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
271 stars 67 forks source link

Reproducing HCA data using zUMI #272

Closed HaniJieunKim closed 3 years ago

HaniJieunKim commented 3 years ago

Hi!

Thank you very much for creating and maintaining zUMIs. I have been successful at running the test data and also the fibroblast data from the original Smart-seq3 paper. However when I try running HCA, I get an error (see below for YAML file and log). I would be much grateful for you help!

project: HCA_test
sequence_files:
  file1:
    name: /home/data/data2/Smart-seq3/raw/HCA.R1.fastq.gz
    base_definition:
    - cDNA(23-150)
    - UMI(12-19)
    find_pattern: ATTGCGCAATG
  file2:
    name: /home/data/data2/Smart-seq3/raw/HCA.R2.fastq.gz
    base_definition:
    - cDNA(1-150)
  file3:
    name: /home/data/data2/Smart-seq3/raw/HCA.I1.fastq.gz
    base_definition:
    - BC(1-8)
  file4:
    name: /home/data/data2/Smart-seq3/raw/HCA.I2.fastq.gz
    base_definition:
    - BC(1-8)
reference:
  STAR_index: /home/data/data2/genomeDir/STARgenomes/human2
  GTF_file: /home/data/data2/genomeDir/refdata-gex-GRCh38-2020-A/genes/genes.gtf
  exon_extension: no
  extension_length: 0
  scaffold_length_min: 0
  additional_STAR_params: '--clip3pAdapterSeq CTGTCTCTTATACACATCT --limitSjdbInsertNsj 2000000 --outFilterIntronMotifs --RemoveNoncanonicalUnannotated'
  additional_files: ~
out_dir: /home/data/data2/normalisation/zUMI/test4
num_threads: 50
mem_limit: 200
filter_cutoffs:
  BC_filter:
    num_bases: 3
    phred: 20
  UMI_filter:
    num_bases: 3
    phred: 20
barcodes:
  barcode_num: ~
  barcode_file: ~
  barcode_sharing: ~
  automatic: yes
  BarcodeBinning: 1
  nReadsperCell: 100
counting_opts:
  introns: yes
  intronProb: no
  downsampling: '0'
  strand: 0
  Ham_Dist: 1
  velocyto: no
  primaryHit: yes
  multi_overlap: no
  twoPass: no
  demultiplex: no
make_stats: yes
which_Stage: Filtering
samtools_exec: samtools
pigz_exec: pigz
#STAR_exec: /home/ubuntu/STAR/bin/Linux_x86_64/STAR
STAR_exec: STAR
Rscript_exec: Rscript
zUMIs_directory: /home/ubuntu/zUMIs
Using miniconda environment for zUMIs!
 note: internal executables will be used instead of those specified in the YAML file!

 You provided these parameters:
 YAML file:     script/HCA_test_v1.yaml
 zUMIs directory:               /home/ubuntu/zUMIs
 STAR executable                STAR
 samtools executable            samtools
 pigz executable                pigz
 Rscript executable             Rscript
 RAM limit:   200
 zUMIs version 2.9.6 

Fri Jul 16 04:03:11 UTC 2021
WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounte
r any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a.
Filtering...
pigz: skipping: /home/data/data2/Smart-seq3/raw/HCA.R2.fastq.gz: corrupted -- incomplete deflate data
pigz: abort: internal threads error
pigz: skipping: /home/data/data2/Smart-seq3/raw/HCA.R1.fastq.gz: corrupted -- incomplete deflate data
pigz: abort: internal threads error
pigz: skipping: /home/data/data2/normalisation/zUMI/test4/zUMIs_output/.tmpMerge/HCA.R2.fastqHCA_testbt.gz does not exist
@A00187:188:HM3HKDSXX:4:2657:24532:15076

ERROR! Fastq files are not in the same order.
 Make sure to provide reads in the same order.

@A00187:188:HM3HKDSXX:4:2657:24551:15076

ERROR! Fastq files are not in the same order.
 Make sure to provide reads in the same order.

These are the FASTQ files downloaded.

-rw-rw-r-- 1 ubuntu ubuntu  29660183738 May 15  2020 HCA.I1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu  30477852902 May 15  2020 HCA.I2.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 227299256599 Jul 15 07:49 HCA.R1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 227987249699 Jul 15 07:49 HCA.R2.fastq.gz

Many thanks for you time and help!

Best regards,

Hani

cziegenhain commented 3 years ago

Hi,

As you can read in the log, one of your input fastq files is incomplete/faulty.

pigz: skipping: /home/data/data2/Smart-seq3/raw/HCA.R2.fastq.gz: corrupted -- incomplete deflate data

Best, C

HaniJieunKim commented 3 years ago

Thanks Christoph, I have downloaded the fastq files again because, and I don’t get any obvious errors when downloading..

Is there a way I could troubleshoot? When I look at the fastq files, there isn’t anything that is obviously wrong (i've included images below).

Thank you very much.

Best regards,

Hani

HCA.R1.fastq.gz image

HCA.R1.fastq.gz image

HCA.I1.fastq.gz image

HCA.I2.fastq.gz image

cziegenhain commented 3 years ago

for example, you can count the number of reads in each of the files that you download to confirm they all match. I'm sure ArrayExpress also has md5 checksums to verify your download.

Closing this issue as it is not related to zUMIs.

HaniJieunKim commented 3 years ago

Great!

Thank you very much.