kundajelab / atac_dnase_pipelines

ATAC-seq and DNase-seq processing pipeline
BSD 3-Clause "New" or "Revised" License
159 stars 81 forks source link

Error with adapter auto detection #103

Closed easwaranramamurthy closed 6 years ago

easwaranramamurthy commented 6 years ago

Hi,

When I run the following command:

bds atac.bds -nth 1 -type atac-seq -species hg19 -title atac_process -auto_detect_adapter -fastq1_1 file_1.fastq -fastq1_2 file_2.fastq -out_dir processed_atac

I get the following error:

Fatal error: /path/to/atac.bds, line 737, pos 6. Task/s failed.
atac.bds, line 82 :     main()
atac.bds, line 85 :     void main() { // atac pipeline starts here
atac.bds, line 95 :             do_align()
atac.bds, line 394 :    void do_align() {
atac.bds, line 420 :            for (int rep=1; rep<=get_num_rep(); rep++) {
atac.bds, line 422 :                    if ( no_par ) do_align( rep, nth_rep{rep} )
atac.bds, line 431 :    void do_align( int rep, int nth_rep ) {
atac.bds, line 433 :            if ( is_se( rep ) )     align_SE( rep, nth_rep )
atac.bds, line 434 :            else                    align_PE( rep, nth_rep )
atac.bds, line 695 :    void align_PE( int rep, int nth_rep ) {
atac.bds, line 708 :            if ( is_input_fastq( rep ) ) {
atac.bds, line 722 :                    for ( string pool_id : fastqs_pair1.keys() ) {
atac.bds, line 724 :                            if ( !(adapters_pair1.hasKey(pool_id) && adapters_pair2.hasKey(pool_id)) \
atac.bds, line 730 :                            else {
atac.bds, line 732 :                                    if ( !(adapters_pair1.hasKey(pool_id) && adapters_pair2.hasKey(pool_id)) \
atac.bds, line 733 :                                            && auto_detect_adapter ) {
atac.bds, line 737 :                                            wait [tid1, tid2]

Any help with debugging this would be appreciated

leepc12 commented 6 years ago

First of all, -type atac-seq must be replaced with -type atac. Please post a full log. I need more info for debugging.

BTW, did you install conda and dependencies correctly?

easwaranramamurthy commented 6 years ago

Thank you. The confusion arose because the documentation says that it should be atac-seq instead of atac.

$ bds atac.bds
== atac pipeline settings
        -type <string>                   : Type of the pipeline. atac-seq or dnase-seq (default: atac-seq).

Yes, conda and dependencies are installed correctly. Here is the full standard error:

Traceback (most recent call last):
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 70, in <module>
    main()
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 66, in main
    best_adapter = detect_most_likely_adapter(sys.argv[1])
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 40, in detect_most_likely_adapter
    observed_adapters, adapter_cnts, n_obs_adapters = detect_adapters_and_cnts(fname)
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 25, in detect_adapters_and_cnts
    for seq_index, line in enumerate(fp):
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 372, in readline
    return self._buffer.readline(size)
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 409, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'@S')
Traceback (most recent call last):
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 70, in <module>
    main()
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 66, in main
    best_adapter = detect_most_likely_adapter(sys.argv[1])
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 40, in detect_most_likely_adapter
    observed_adapters, adapter_cnts, n_obs_adapters = detect_adapters_and_cnts(fname)
  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 25, in detect_adapters_and_cnts
    for seq_index, line in enumerate(fp):
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 372, in readline
    return self._buffer.readline(size)
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 409, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'@S')
Task failed:
        Program & line     : '/path/to/atac/atac_dnase_pipelines/modules/align_trim_adapter.bds', line 48
        Task Name          : 'detect_adapter rep1'
        Task ID            : 'atac.bds.20180417_154414_913/task.align_trim_adapter.detect_adapter_rep1.line_48.id_10'
        Task PID           : '112435'
        Task hint          : 'python3 $(which detect_adapter.py) file_1.fastq > processed_atac/qc/rep1/file_1.adapter.txt; TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0'
        Task resources     : 'cpus: -1  mem: -1.0 B     wall-timeout: 8640000'
        State              : 'ERROR'
        Dependency state   : 'ERROR'
        Retries available  : '1'
        Input files        : '[file_1.fastq]'
        Output files       : '[processed_atac/qc/rep1/file_1.adapter.txt]'
        Script file        : '/path/to/atac.bds.20180417_154414_913/task.align_trim_adapter.detect_adapter_rep1.line_48.id_10.sh'
        Exit status        : '1'
        Program            :

                # SYS command. line 50

                 if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/path/to/atac/atac_dnase_pipelines/.:/path/to/atac/atac_dnase_pipelines/modules:/path/to/atac/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

                # SYS command. line 52

                 python3 $(which detect_adapter.py) file_1.fastq > processed_atac/qc/rep1/file_1.adapter.txt

                # SYS command. line 54

                 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
        StdErr (100000000 lines)  :
                Traceback (most recent call last):
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 70, in <module>
                    main()
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 66, in main
                    best_adapter = detect_most_likely_adapter(sys.argv[1])
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 40, in detect_most_likely_adapter
                    observed_adapters, adapter_cnts, n_obs_adapters = detect_adapters_and_cnts(fname)
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 25, in detect_adapters_and_cnts
                    for seq_index, line in enumerate(fp):
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 372, in readline
                    return self._buffer.readline(size)
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/_compression.py", line 68, in readinto
                    data = self.read(len(byte_view))
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 461, in read
                    if not self._read_gzip_header():
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 409, in _read_gzip_header
                    raise OSError('Not a gzipped file (%r)' % magic)
                OSError: Not a gzipped file (b'@S')

Task failed:
        Program & line     : '/path/to/atac/atac_dnase_pipelines/modules/align_trim_adapter.bds', line 48
        Task Name          : 'detect_adapter rep1'
        Task ID            : 'atac.bds.20180417_154414_913/task.align_trim_adapter.detect_adapter_rep1.line_48.id_11'
        Task PID           : '112441'
        Task hint          : 'python3 $(which detect_adapter.py) file_2.fastq > processed_atac/qc/rep1/file_2.adapter.txt; TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0'
        Task resources     : 'cpus: -1  mem: -1.0 B     wall-timeout: 8640000'
        State              : 'ERROR'
        Dependency state   : 'ERROR'
        Retries available  : '1'
        Input files        : '[file_2.fastq]'
        Output files       : '[processed_atac/qc/rep1/file_2.adapter.txt]'
        Script file        : '/path/to/atac.bds.20180417_154414_913/task.align_trim_adapter.detect_adapter_rep1.line_48.id_11.sh'
        Exit status        : '1'
        Program            :

                # SYS command. line 50

                 if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/path/to/atac/atac_dnase_pipelines/.:/path/to/atac/atac_dnase_pipelines/modules:/path/to/atac/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

                # SYS command. line 52

                 python3 $(which detect_adapter.py) file_2.fastq > processed_atac/qc/rep1/file_2.adapter.txt

                # SYS command. line 54

                 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
        StdErr (100000000 lines)  :
                Traceback (most recent call last):
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 70, in <module>
                    main()
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 66, in main
                    best_adapter = detect_most_likely_adapter(sys.argv[1])
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 40, in detect_most_likely_adapter
                    observed_adapters, adapter_cnts, n_obs_adapters = detect_adapters_and_cnts(fname)
                  File "/path/to/atac/atac_dnase_pipelines/utils/detect_adapter.py", line 25, in detect_adapters_and_cnts
                    for seq_index, line in enumerate(fp):
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 372, in readline
                    return self._buffer.readline(size)
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/_compression.py", line 68, in readinto
                    data = self.read(len(byte_view))
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 461, in read
                    if not self._read_gzip_header():
                  File "/path/to/miniconda3/envs/bds_atac_py3/lib/python3.5/gzip.py", line 409, in _read_gzip_header
                    raise OSError('Not a gzipped file (%r)' % magic)
                OSError: Not a gzipped file (b'@S')

Fatal error: /path/to/atac/atac_dnase_pipelines/atac.bds, line 737, pos 6. Task/s failed.
atac.bds, line 82 :     main()
atac.bds, line 85 :     void main() { // atac pipeline starts here
atac.bds, line 95 :             do_align()
atac.bds, line 394 :    void do_align() {
atac.bds, line 420 :            for (int rep=1; rep<=get_num_rep(); rep++) {
atac.bds, line 422 :                    if ( no_par ) do_align( rep, nth_rep{rep} )
atac.bds, line 431 :    void do_align( int rep, int nth_rep ) {
atac.bds, line 433 :            if ( is_se( rep ) )     align_SE( rep, nth_rep )
atac.bds, line 434 :            else                    align_PE( rep, nth_rep )
atac.bds, line 695 :    void align_PE( int rep, int nth_rep ) {
atac.bds, line 708 :            if ( is_input_fastq( rep ) ) {
atac.bds, line 722 :                    for ( string pool_id : fastqs_pair1.keys() ) {
atac.bds, line 724 :                            if ( !(adapters_pair1.hasKey(pool_id) && adapters_pair2.hasKey(pool_id)) \
atac.bds, line 730 :                            else {
atac.bds, line 732 :                                    if ( !(adapters_pair1.hasKey(pool_id) && adapters_pair2.hasKey(pool_id)) \
atac.bds, line 733 :                                            && auto_detect_adapter ) {
atac.bds, line 737 :                                            wait [tid1, tid2]

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

The standard out is pasted below:

== git info
Latest git commit               : cb35850349faa552054ae453077881e06a87b1be (Mon Apr 16 16:58:56 2018)
        Reading parameters from section (default) in file(/path/to/atac/atac_dnase_pipelines/default.env)...

== configuration file info
Hostname                        : compute-1-13.local
Configuration file              :
Environment file                : /path/to/atac/atac_dnase_pipelines/default.env

Warning: Maximum # threads (-nth) for a pipeline is <= 1. Turning off parallelization... (-no_par)

== parallelization info
No parallel jobs                : true
Maximum # threads               : 1

== cluster/system info
Walltime (general)              : 5h50m
Max. memory (general)           : 7G
Force to use a system           : local
Process priority (niceness)     : 0
Retiral for failed tasks        : 0
Submit tasks to a cluster queue :
Unlimited cluster mem./walltime : false
Use --acount instead of SLURM partition         : false
Java temporary directory                : ${TMPDIR}

== shell environment info
Conda env.                      : bds_atac
Conda env. for python3          : bds_atac_py3
Conda bin. directory            :

Shell cmd. for init.            : if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/path/to/atac/atac_dnase_pipelines/.:/path/to/atac/atac_dnase_pipelines/modules:/path/to/atac/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for init.(py3)       : if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/path/to/atac/atac_dnase_pipelines/.:/path/to/atac/atac_dnase_pipelines/modules:/path/to/atac/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for fin.             : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0

Cluster task min. len.          : 60

Cluster task delay                      : 0

== output directory/title info
Output dir.                     : processed_atac
Title (prefix)                  : atac_process 
        Reading parameters from section (default) in file(/path/to/atac/atac_dnase_pipelines/default.env)...
        Reading parameters from section (hg19) in file(/path/to/genomes/bds_atac_species.conf)...

== species settings
Species                         : hg19
Species file                    : /path/to/genomes/bds_atac_species.conf

Species name (WashU browser)    : hg19
Ref. genome seq. fasta          : /path/to/genomes/hg19/male.hg19.fa
Chr. sizes file                 : /path/to/genomes/hg19/hg19.chrom.sizes
Black list bed                  : /path/to/genomes/hg19/wgEncodeDacMapabilityConsensusExcludable.bed.gz
Ref. genome seq. dir.           :

== ENCODE accession settings
ENCODE experiment accession     :
ENCODE award RFA                        :
ENCODE assay category           :
ENCODE assay title              :
ENCODE award                            :
ENCODE lab                              :
ENCODE assembly genome          :
ENCODE alias prefix             : KLAB_PIPELINE
ENCODE alias suffix             :

== report settings
URL root for output directory   :
Genome coord. for browser tracks        :

== align multimapping settings
# alignments reported for multimapping  : 0

== align bowtie2 settings
Bowtie2 index                   : /path/to/genomes/hg19/bowtie2_index/male.hg19.fa
Replacement --score-min for bowtie2     :
Walltime (bowtie2)              : 47h
Max. memory (bowtie2)           : 12G
Extra param. (bowtie2)          :
Disable index on memory (bowtie2)       : false

== adapter trimmer settings
Maximum allowed error rate for cutadapt : 0.10
Minimum trim. length for cutadapt -m    : 5
Walltime (adapter trimming)             : 23h
Max. memory (adapter trimming)          : 12G

== postalign bam settings
MAPQ reads rm thresh.           : 30
Rm. tag reads with str.         : chrM
No dupe removal in filtering raw bam    : false
Walltime (bam filter)           : 23h
Max. memory (bam filter)        : 12G
Dup marker                              : picard
Use sambamba markdup (instead of picard)        : false

== postalign bed/tagalign settings
Max. memory for UNIX shuf                       : 12G
No --random-source for UNIX shuf                : false

== postalign cross-corr. analysis settings
Max. memory for UNIX shuf                       : 12G
User-defined cross-corr. peak strandshift       : -1
Extra parameters for cross-corr. analysis       :
Max. memory for cross-corr. analysis            : 15G

== callpeak macs2 settings
Genome size (hs,mm)             : hs
Walltime (macs2)                : 23h
Max. memory (macs2)             : 15G
Cap number of peaks (macs2)     : 300K
Extra parameters for macs2 callpeak             :

== callpeak naiver overlap settings
Bedtools intersect -nonamecheck : false

== IDR settings
Append IDR threshold to IDR out_dir     : false

== ATAQC settings
TSS enrichment bed              : /path/to/genomes/hg19/ataqc/hg19_gencode_tss_unique.bed.gz
DNase bed for ataqc             : /path/to/genomes/hg19/ataqc/reg2map_honeybadger2_dnase_all_p10_ucsc.bed.gz
Promoter bed for ataqc          : /path/to/genomes/hg19/ataqc/reg2map_honeybadger2_dnase_prom_p2.bed.gz
Enhancer bed for ataqc          : /path/to/genomes/hg19/ataqc/reg2map_honeybadger2_dnase_enh_p2.bed.gz
Reg2map for ataqc                       : /path/to/genomes/hg19/ataqc/dnase_avgs_reg2map_p10_merged_named.pvals.gz
Reg2map_bed for ataqc           : /path/to/genomes/hg19/ataqc/reg2map_honeybadger2_dnase_all_p10_ucsc.bed.gz
Roadmap metadata for ataqc      : /path/to/genomes/hg19/ataqc/eid_to_mnemonic.txt
Max. memory for ATAQC                   : 20G
Walltime for ATAQC                      : 47h

== atac pipeline settings
Type of pipeline                        : atac-seq
Align only                              : false
# reads to subsample replicates (0 if no subsampling)   : 0
# reads to subsample for cross-corr. analysis   : 25000000
No pseudo replicates                    : false
No ATAQC (advanced QC report)           : false
No Cross-corr. analysis                 : false
Use CSEM for alignment                  : false
Smoothing window for MACS2              : 150
DNase Seq                               : false
IDR threshold                           : 0.1
Force to use ENCODE3 parameter set      : false
Force to use ENCODE parameter set       : false
Disable genome browser tracks   : false
p-val thresh. for overlapped peaks      : 0.01
MACS2 p-val thresh. for peaks   : 0.01
MACS2 p-val thresh. for BIGWIGs         : 0.01
Enable IDR on called peaks      : false
Automatically find/trim adapters        : true

== checking atac parameters ...

Checking parameters and data files for ATAQC.

== checking adapters to be trimmed ...
Rep1 R1 adapters (PE) :
        00: automatically detected and trimmed.
Rep1 R2 adapters (PE) :
        00: automatically detected and trimmed.

== checking input files ...

Rep1 fastq (PE) :
        file_1.fastq
        file_2.fastq
Distributing 1 to ...
{1=1}
leepc12 commented 6 years ago

Oops, sorry about that. Pipeline actually works with both atac or atac-seq (also for dnase and dnase-seq). So README is correct.

Fastqs must be gzipped first. Please gzip it and try again.

easwaranramamurthy commented 6 years ago

Thanks a lot! Gzipping the fastqs worked.