Data set - Githubissues

rehamFatima commented 6 years ago

Hi,

I have been trying to run your pipeline on the mm10 data set (as setup by running install_genome_data.sh ). As there were no fast files, I made a bam file from the bed files and am running the pipeline using the following command :

$ bds atac.bds -species mm10 -gensz mm -bam1 ../genomes/mm10_enh_dhs.bam -chrsz ../genomes/mm10/mm10.chrom.sizes

I am not getting the output repository structure as mentioned on your main page, nor a Report.html file. If you could help with this, I would highly appreciate it.

Many thanks. Kind Regards, Reham

leepc12 commented 6 years ago

If you don't explicitly specify output directory with -out_dir [SOMEWHERE] then out/ will be generated on the working directory where you ran the pipeline command. Also skip -gensz mm. it's redundant.

rehamFatima commented 6 years ago

out/ is generated, but it does not have all the files as listed on your main page. Could you please tell me what data have you used to generate all those files ? I am not quite sure whether the discrepancy is in the data set I am using or something is missing from the pipeline installation.

Thank you

rehamFatima commented 6 years ago

Could you also please tell me which fastqs you have used as part of running the pipeline ? as the data set coming from the install_genome_data.sh only contains fasta and bed files.

Many thanks

rehamFatima commented 6 years ago

I have tried running it with different fastqs from various sources, but that crashes down with Error (modules/align_bowtie2.bds, line 44, pos 3): Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)

leepc12 commented 6 years ago

Did you specify species (hg38, mm10, ...) by -species [SPECIES]? Did you install genome data correctly? Did you see a "success" message at the end of the genome data installation shell script?

rehamFatima commented 6 years ago

Thanks Jin, yes I have specified -species mm10 . I have reinstalled the genome data and did see the "success" message. It still does not contain any fastq files. Is it supposed to or are the fastq files to be retrieved from elsewhere ?

leepc12 commented 6 years ago

Please post a full log (including Bowtie2 index (-bwt2_idx) doesn't exists!) for debugging info. Also, what is the full command line that you used for running a pipeline?

rehamFatima commented 6 years ago

The command is e.g :

bds atac.bds -species mm10 -se -gensz mm -fastq1 ../genomes/mm10_no_alt_analysis_set_ENCODE.fastq.gz -chrsz ../genomes/mm10/mm10.chrom.sizes

and the output is :

Picked up _JAVA_OPTIONS: -Xms8G -Xmx8G -XX:ParallelGCThreads=1

== git info
Latest git commit       : d0babc6d2cc0aa8038c5ad35a41ed9876f49619d (Tue Jun 5 18:40:20 2018)
    Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...

== configuration file info
Hostname            : ebi5-164.ebi.ac.uk
Configuration file      : 
Environment file        : /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env

== parallelization info
No parallel jobs        : false
Maximum # threads       : 8

== cluster/system info
Walltime (general)      : 5h50m
Max. memory (general)       : 7G
Force to use a system       : local
Process priority (niceness) : 0
Retiral for failed tasks    : 0
Submit tasks to a cluster queue : 
Unlimited cluster mem./walltime : false
Use --acount instead of SLURM partition     : false
Java temporary directory        : ${TMPDIR}

== shell environment info
Conda env.          : bds_atac
Conda env. for python3      : bds_atac_py3
Conda bin. directory        : 

Shell cmd. for init.        : if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for init.(py3)   : if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for fin.     : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0

Cluster task min. len.      : 60

Cluster task delay          : 0

== output directory/title info
Output dir.         : /homes/reham/ATAC-Seq/atac_dnase_pipelines/out
Title (prefix)          : atac_dnase_pipelines
    Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...

== species settings
Species             : mm10
Species file            : 

Species name (WashU browser)    : mm10
Ref. genome seq. fasta      : 
Chr. sizes file         : ../genomes/mm10/mm10.chrom.sizes
Black list bed          : 
Ref. genome seq. dir.       : 

== ENCODE accession settings
ENCODE experiment accession : 
ENCODE award RFA            : 
ENCODE assay category       : 
ENCODE assay title      : 
ENCODE award                : 
ENCODE lab              : 
ENCODE assembly genome      : 
ENCODE alias prefix     : KLAB_PIPELINE
ENCODE alias suffix     : 

== report settings
URL root for output directory   : 
Genome coord. for browser tracks    : 

== align multimapping settings
# alignments reported for multimapping  : 0

== align bowtie2 settings
Bowtie2 index           : 
Replacement --score-min for bowtie2 : 
Walltime (bowtie2)      : 47h
Max. memory (bowtie2)       : 12G
Extra param. (bowtie2)      : 
Disable index on memory (bowtie2)   : false

== adapter trimmer settings
Maximum allowed error rate for cutadapt : 0.10
Minimum trim. length for cutadapt -m    : 5
Walltime (adapter trimming)     : 23h
Max. memory (adapter trimming)      : 12G

== postalign bam settings
MAPQ reads rm thresh.       : 30
Rm. tag reads with str.     : chrM
No dupe removal in filtering raw bam    : false
Walltime (bam filter)       : 23h
Max. memory (bam filter)    : 12G
Dup marker              : picard
Use sambamba markdup (instead of picard)    : false

== postalign bed/tagalign settings
Max. memory for UNIX shuf           : 12G
No --random-source for UNIX shuf        : false

== postalign cross-corr. analysis settings
Max. memory for UNIX shuf           : 12G
User-defined cross-corr. peak strandshift   : -1
Extra parameters for cross-corr. analysis   : 
Max. memory for cross-corr. analysis        : 15G
Stack size for cross-corr. analysis     :

== callpeak macs2 settings
Genome size (hs,mm)     : mm
Walltime (macs2)        : 23h
Max. memory (macs2)     : 15G
Cap number of peaks (macs2) : 300K
Extra parameters for macs2 callpeak     : 

== callpeak naiver overlap settings
Bedtools intersect -nonamecheck : false

== IDR settings
Append IDR threshold to IDR out_dir : false

== ATAQC settings
TSS enrichment bed      : 
DNase bed for ataqc     : 
Promoter bed for ataqc      : 
Enhancer bed for ataqc      : 
Reg2map for ataqc           : 
Reg2map_bed for ataqc       : 
Roadmap metadata for ataqc  : 
Max. memory for ATAQC           : 20G
Walltime for ATAQC          : 47h

== atac pipeline settings
Type of pipeline            : atac-seq
Align only              : false
# reads to subsample replicates (0 if no subsampling)   : 0
# reads to subsample for cross-corr. analysis   : 25000000
No pseudo replicates            : false
No ATAQC (advanced QC report)       : false
No Cross-corr. analysis         : false
Use CSEM for alignment          : false
Smoothing window for MACS2      : 150
DNase Seq               : false
IDR threshold               : 0.1
Force to use ENCODE3 parameter set  : false
Force to use ENCODE parameter set   : false
Disable genome browser tracks   : false
p-val thresh. for overlapped peaks  : 0.01
MACS2 p-val thresh. for peaks   : 0.01
MACS2 p-val thresh. for BIGWIGs     : 0.01
Enable IDR on called peaks  : false
Automatically find/trim adapters    : false

== checking atac parameters ...
00:00:00.925    Error (modules/align_bowtie2.bds, line 44, pos 3): Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)

leepc12 commented 6 years ago

Please do the following for more info:

$ cat /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
$ ls -l /homes/reham/ATAC-Seq/genomes/mm10
$ ls -l /homes/reham/ATAC-Seq/genomes

Jin



On Mon, Jul 2, 2018 at 8:22 AM, rehamFatima <notifications@github.com>
wrote:

> The command is e.g :
>
> bds atac.bds -species mm10 -se -gensz mm -fastq1 ../genomes/mm10_no_alt_analysis_set_ENCODE.fastq.gz
> -chrsz ../genomes/mm10/mm10.chrom.sizes
>
> and the output is :
>
> Picked up _JAVA_OPTIONS: -Xms8G -Xmx8G -XX:ParallelGCThreads=1
>
>
> == git info
> Latest git commit     : d0babc6d2cc0aa8038c5ad35a41ed9876f49619d (Tue Jun 5 18:40:20 2018)
>   Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
>
>
> == configuration file info
> Hostname          : ebi5-164.ebi.ac.uk
> Configuration file        :
> Environment file      : /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
>
>
> == parallelization info
> No parallel jobs      : false
> Maximum # threads         : 8
>
>
> == cluster/system info
> Walltime (general)        : 5h50m
> Max. memory (general)     : 7G
> Force to use a system     : local
> Process priority (niceness)   : 0
> Retiral for failed tasks  : 0
> Submit tasks to a cluster queue   :
> Unlimited cluster mem./walltime   : false
> Use --acount instead of SLURM partition       : false
> Java temporary directory      : ${TMPDIR}
>
>
> == shell environment info
> Conda env.            : bds_atac
> Conda env. for python3        : bds_atac_py3
> Conda bin. directory      :
>
> Shell cmd. for init.      : if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)
>
> Shell cmd. for init.(py3) : if [[ -f $(which conda) && $(conda env list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate bds_atac_py3; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)
>
> Shell cmd. for fin.       : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
>
> Cluster task min. len.        : 60
>
> Cluster task delay            : 0
>
>
> == output directory/title info
> Output dir.           : /homes/reham/ATAC-Seq/atac_dnase_pipelines/out
> Title (prefix)            : atac_dnase_pipelines
>   Reading parameters from section (default) in file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
>
>
> == species settings
> Species               : mm10
> Species file          :
>
> Species name (WashU browser)  : mm10
> Ref. genome seq. fasta        :
> Chr. sizes file           : ../genomes/mm10/mm10.chrom.sizes
> Black list bed            :
> Ref. genome seq. dir.     :
>
>
> == ENCODE accession settings
> ENCODE experiment accession   :
> ENCODE award RFA          :
> ENCODE assay category     :
> ENCODE assay title        :
> ENCODE award              :
> ENCODE lab                :
> ENCODE assembly genome        :
> ENCODE alias prefix       : KLAB_PIPELINE
> ENCODE alias suffix       :
>
>
> == report settings
> URL root for output directory :
> Genome coord. for browser tracks  :
>
>
> == align multimapping settings
> # alignments reported for multimapping    : 0
>
>
> == align bowtie2 settings
> Bowtie2 index         :
> Replacement --score-min for bowtie2   :
> Walltime (bowtie2)        : 47h
> Max. memory (bowtie2)     : 12G
> Extra param. (bowtie2)        :
> Disable index on memory (bowtie2) : false
>
>
> == adapter trimmer settings
> Maximum allowed error rate for cutadapt   : 0.10
> Minimum trim. length for cutadapt -m  : 5
> Walltime (adapter trimming)       : 23h
> Max. memory (adapter trimming)        : 12G
>
>
> == postalign bam settings
> MAPQ reads rm thresh.     : 30
> Rm. tag reads with str.       : chrM
> No dupe removal in filtering raw bam  : false
> Walltime (bam filter)     : 23h
> Max. memory (bam filter)  : 12G
> Dup marker                : picard
> Use sambamba markdup (instead of picard)  : false
>
>
> == postalign bed/tagalign settings
> Max. memory for UNIX shuf         : 12G
> No --random-source for UNIX shuf      : false
>
>
> == postalign cross-corr. analysis settings
> Max. memory for UNIX shuf         : 12G
> User-defined cross-corr. peak strandshift : -1
> Extra parameters for cross-corr. analysis :
> Max. memory for cross-corr. analysis      : 15G
> Stack size for cross-corr. analysis       :
>
>
> == callpeak macs2 settings
> Genome size (hs,mm)       : mm
> Walltime (macs2)      : 23h
> Max. memory (macs2)       : 15G
> Cap number of peaks (macs2)   : 300K
> Extra parameters for macs2 callpeak       :
>
>
> == callpeak naiver overlap settings
> Bedtools intersect -nonamecheck   : false
>
>
> == IDR settings
> Append IDR threshold to IDR out_dir   : false
>
>
> == ATAQC settings
> TSS enrichment bed        :
> DNase bed for ataqc       :
> Promoter bed for ataqc        :
> Enhancer bed for ataqc        :
> Reg2map for ataqc         :
> Reg2map_bed for ataqc     :
> Roadmap metadata for ataqc    :
> Max. memory for ATAQC         : 20G
> Walltime for ATAQC            : 47h
>
>
> == atac pipeline settings
> Type of pipeline          : atac-seq
> Align only                : false
> # reads to subsample replicates (0 if no subsampling) : 0
> # reads to subsample for cross-corr. analysis     : 25000000
> No pseudo replicates          : false
> No ATAQC (advanced QC report)     : false
> No Cross-corr. analysis           : false
> Use CSEM for alignment            : false
> Smoothing window for MACS2        : 150
> DNase Seq             : false
> IDR threshold             : 0.1
> Force to use ENCODE3 parameter set    : false
> Force to use ENCODE parameter set : false
> Disable genome browser tracks : false
> p-val thresh. for overlapped peaks    : 0.01
> MACS2 p-val thresh. for peaks : 0.01
> MACS2 p-val thresh. for BIGWIGs       : 0.01
> Enable IDR on called peaks    : false
> Automatically find/trim adapters  : false
>
> == checking atac parameters ...
> 00:00:00.925  Error (modules/align_bowtie2.bds, line 44, pos 3): Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)
>
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <https://github.com/kundajelab/atac_dnase_pipelines/issues/126#issuecomment-401841414>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AIOd_GZ5SjTMj8HWBqaLWCUd0eu31tovks5uCjq6gaJpZM4U3hfe>
> .
>

rehamFatima commented 6 years ago

Thanks Jin,

should I send you the output for these ?

On 2018-07-02 16:28, Jin Lee wrote:

Please do the following for more info:

$ cat /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
$ ls -l /homes/reham/ATAC-Seq/genomes/mm10
$ ls -l /homes/reham/ATAC-Seq/genomes

Jin



On Mon, Jul 2, 2018 at 8:22 AM, rehamFatima <notifications@github.com>
wrote:

> The command is e.g :
> 
> bds atac.bds -species mm10 -se -gensz mm -fastq1
../genomes/mm10_no_alt_analysis_set_ENCODE.fastq.gz
> -chrsz ../genomes/mm10/mm10.chrom.sizes
> 
> and the output is :
> 
> Picked up _JAVA_OPTIONS: -Xms8G -Xmx8G -XX:ParallelGCThreads=1
> 
> 
> == git info
> Latest git commit : d0babc6d2cc0aa8038c5ad35a41ed9876f49619d (Tue
Jun 5 18:40:20 2018)
> Reading parameters from section (default) in
file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
> 
> 
> == configuration file info
> Hostname : ebi5-164.ebi.ac.uk
> Configuration file :
> Environment file :
/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
> 
> 
> == parallelization info
> No parallel jobs : false
> Maximum # threads : 8
> 
> 
> == cluster/system info
> Walltime (general) : 5h50m
> Max. memory (general) : 7G
> Force to use a system : local
> Process priority (niceness) : 0
> Retiral for failed tasks : 0
> Submit tasks to a cluster queue :
> Unlimited cluster mem./walltime : false
> Use --acount instead of SLURM partition : false
> Java temporary directory : ${TMPDIR}
> 
> 
> == shell environment info
> Conda env. : bds_atac
> Conda env. for python3 : bds_atac_py3
> Conda bin. directory :
> 
> Shell cmd. for init. : if [[ -f $(which conda) && $(conda env list |
grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep
5; fi; export
PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds;
set -o pipefail; STARTTIME=$(date +%s)
> 
> Shell cmd. for init.(py3) : if [[ -f $(which conda) && $(conda env
list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate
bds_atac_py3; sleep 5; fi; export
PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds;
set -o pipefail; STARTTIME=$(date +%s)
> 
> Shell cmd. for fin. : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo
"Task has finished (${TASKTIME} seconds)."; sleep 0
> 
> Cluster task min. len. : 60
> 
> Cluster task delay : 0
> 
> 
> == output directory/title info
> Output dir. : /homes/reham/ATAC-Seq/atac_dnase_pipelines/out
> Title (prefix) : atac_dnase_pipelines
> Reading parameters from section (default) in
file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
> 
> 
> == species settings
> Species : mm10
> Species file :
> 
> Species name (WashU browser) : mm10
> Ref. genome seq. fasta :
> Chr. sizes file : ../genomes/mm10/mm10.chrom.sizes
> Black list bed :
> Ref. genome seq. dir. :
> 
> 
> == ENCODE accession settings
> ENCODE experiment accession :
> ENCODE award RFA :
> ENCODE assay category :
> ENCODE assay title :
> ENCODE award :
> ENCODE lab :
> ENCODE assembly genome :
> ENCODE alias prefix : KLAB_PIPELINE
> ENCODE alias suffix :
> 
> 
> == report settings
> URL root for output directory :
> Genome coord. for browser tracks :
> 
> 
> == align multimapping settings
> # alignments reported for multimapping : 0
> 
> 
> == align bowtie2 settings
> Bowtie2 index :
> Replacement --score-min for bowtie2 :
> Walltime (bowtie2) : 47h
> Max. memory (bowtie2) : 12G
> Extra param. (bowtie2) :
> Disable index on memory (bowtie2) : false
> 
> 
> == adapter trimmer settings
> Maximum allowed error rate for cutadapt : 0.10
> Minimum trim. length for cutadapt -m : 5
> Walltime (adapter trimming) : 23h
> Max. memory (adapter trimming) : 12G
> 
> 
> == postalign bam settings
> MAPQ reads rm thresh. : 30
> Rm. tag reads with str. : chrM
> No dupe removal in filtering raw bam : false
> Walltime (bam filter) : 23h
> Max. memory (bam filter) : 12G
> Dup marker : picard
> Use sambamba markdup (instead of picard) : false
> 
> 
> == postalign bed/tagalign settings
> Max. memory for UNIX shuf : 12G
> No --random-source for UNIX shuf : false
> 
> 
> == postalign cross-corr. analysis settings
> Max. memory for UNIX shuf : 12G
> User-defined cross-corr. peak strandshift : -1
> Extra parameters for cross-corr. analysis :
> Max. memory for cross-corr. analysis : 15G
> Stack size for cross-corr. analysis :
> 
> 
> == callpeak macs2 settings
> Genome size (hs,mm) : mm
> Walltime (macs2) : 23h
> Max. memory (macs2) : 15G
> Cap number of peaks (macs2) : 300K
> Extra parameters for macs2 callpeak :
> 
> 
> == callpeak naiver overlap settings
> Bedtools intersect -nonamecheck : false
> 
> 
> == IDR settings
> Append IDR threshold to IDR out_dir : false
> 
> 
> == ATAQC settings
> TSS enrichment bed :
> DNase bed for ataqc :
> Promoter bed for ataqc :
> Enhancer bed for ataqc :
> Reg2map for ataqc :
> Reg2map_bed for ataqc :
> Roadmap metadata for ataqc :
> Max. memory for ATAQC : 20G
> Walltime for ATAQC : 47h
> 
> 
> == atac pipeline settings
> Type of pipeline : atac-seq
> Align only : false
> # reads to subsample replicates (0 if no subsampling) : 0
> # reads to subsample for cross-corr. analysis : 25000000
> No pseudo replicates : false
> No ATAQC (advanced QC report) : false
> No Cross-corr. analysis : false
> Use CSEM for alignment : false
> Smoothing window for MACS2 : 150
> DNase Seq : false
> IDR threshold : 0.1
> Force to use ENCODE3 parameter set : false
> Force to use ENCODE parameter set : false
> Disable genome browser tracks : false
> p-val thresh. for overlapped peaks : 0.01
> MACS2 p-val thresh. for peaks : 0.01
> MACS2 p-val thresh. for BIGWIGs : 0.01
> Enable IDR on called peaks : false
> Automatically find/trim adapters : false
> 
> == checking atac parameters ...
> 00:00:00.925 Error (modules/align_bowtie2.bds, line 44, pos 3):
Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)
> 
> 
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> 
<https://github.com/kundajelab/atac_dnase_pipelines/issues/126#issuecomment-401841414>,
> or mute the thread
> 
<https://github.com/notifications/unsubscribe-auth/AIOd_GZ5SjTMj8HWBqaLWCUd0eu31tovks5uCjq6gaJpZM4U3hfe>
> .
> 
--
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub [1], or mute the
thread [2].

Links:
------
[1]
https://github.com/kundajelab/atac_dnase_pipelines/issues/126#issuecomment-401843182
[2]
https://github.com/notifications/unsubscribe-auth/Aas3Z7KbU-6dm79PMiLj4ZZhu-P7kzifks5uCjwhgaJpZM4U3hfe

leepc12 commented 6 years ago

Please just post them here

On Mon, Jul 2, 2018 at 9:04 AM, rehamFatima notifications@github.com wrote:

Thanks Jin,

should I send you the output for these ?

On 2018-07-02 16:28, Jin Lee wrote:

Please do the following for more info:

$ cat /homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
$ ls -l /homes/reham/ATAC-Seq/genomes/mm10
$ ls -l /homes/reham/ATAC-Seq/genomes

Jin



On Mon, Jul 2, 2018 at 8:22 AM, rehamFatima <notifications@github.com>
wrote:

> The command is e.g :
>
> bds atac.bds -species mm10 -se -gensz mm -fastq1
../genomes/mm10_no_alt_analysis_set_ENCODE.fastq.gz
> -chrsz ../genomes/mm10/mm10.chrom.sizes
>
> and the output is :
>
> Picked up _JAVA_OPTIONS: -Xms8G -Xmx8G -XX:ParallelGCThreads=1
>
>
> == git info
> Latest git commit : d0babc6d2cc0aa8038c5ad35a41ed9876f49619d (Tue
Jun 5 18:40:20 2018)
> Reading parameters from section (default) in
file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
>
>
> == configuration file info
> Hostname : ebi5-164.ebi.ac.uk
> Configuration file :
> Environment file :
/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env
>
>
> == parallelization info
> No parallel jobs : false
> Maximum # threads : 8
>
>
> == cluster/system info
> Walltime (general) : 5h50m
> Max. memory (general) : 7G
> Force to use a system : local
> Process priority (niceness) : 0
> Retiral for failed tasks : 0
> Submit tasks to a cluster queue :
> Unlimited cluster mem./walltime : false
> Use --acount instead of SLURM partition : false
> Java temporary directory : ${TMPDIR}
>
>
> == shell environment info
> Conda env. : bds_atac
> Conda env. for python3 : bds_atac_py3
> Conda bin. directory :
>
> Shell cmd. for init. : if [[ -f $(which conda) && $(conda env list |
grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep
5; fi; export
PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/
reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/
reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/
usr/bin:/usr/local/bin:${HOME}/.bds;
set -o pipefail; STARTTIME=$(date +%s)
>
> Shell cmd. for init.(py3) : if [[ -f $(which conda) && $(conda env
list | grep bds_atac_py3 | wc -l) != "0" ]]; then source activate
bds_atac_py3; sleep 5; fi; export
PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/
reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/
reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/
usr/bin:/usr/local/bin:${HOME}/.bds;
set -o pipefail; STARTTIME=$(date +%s)
>
> Shell cmd. for fin. : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo
"Task has finished (${TASKTIME} seconds)."; sleep 0
>
> Cluster task min. len. : 60
>
> Cluster task delay : 0
>
>
> == output directory/title info
> Output dir. : /homes/reham/ATAC-Seq/atac_dnase_pipelines/out
> Title (prefix) : atac_dnase_pipelines
> Reading parameters from section (default) in
file(/homes/reham/ATAC-Seq/atac_dnase_pipelines/default.env)...
>
>
> == species settings
> Species : mm10
> Species file :
>
> Species name (WashU browser) : mm10
> Ref. genome seq. fasta :
> Chr. sizes file : ../genomes/mm10/mm10.chrom.sizes
> Black list bed :
> Ref. genome seq. dir. :
>
>
> == ENCODE accession settings
> ENCODE experiment accession :
> ENCODE award RFA :
> ENCODE assay category :
> ENCODE assay title :
> ENCODE award :
> ENCODE lab :
> ENCODE assembly genome :
> ENCODE alias prefix : KLAB_PIPELINE
> ENCODE alias suffix :
>
>
> == report settings
> URL root for output directory :
> Genome coord. for browser tracks :
>
>
> == align multimapping settings
> # alignments reported for multimapping : 0
>
>
> == align bowtie2 settings
> Bowtie2 index :
> Replacement --score-min for bowtie2 :
> Walltime (bowtie2) : 47h
> Max. memory (bowtie2) : 12G
> Extra param. (bowtie2) :
> Disable index on memory (bowtie2) : false
>
>
> == adapter trimmer settings
> Maximum allowed error rate for cutadapt : 0.10
> Minimum trim. length for cutadapt -m : 5
> Walltime (adapter trimming) : 23h
> Max. memory (adapter trimming) : 12G
>
>
> == postalign bam settings
> MAPQ reads rm thresh. : 30
> Rm. tag reads with str. : chrM
> No dupe removal in filtering raw bam : false
> Walltime (bam filter) : 23h
> Max. memory (bam filter) : 12G
> Dup marker : picard
> Use sambamba markdup (instead of picard) : false
>
>
> == postalign bed/tagalign settings
> Max. memory for UNIX shuf : 12G
> No --random-source for UNIX shuf : false
>
>
> == postalign cross-corr. analysis settings
> Max. memory for UNIX shuf : 12G
> User-defined cross-corr. peak strandshift : -1
> Extra parameters for cross-corr. analysis :
> Max. memory for cross-corr. analysis : 15G
> Stack size for cross-corr. analysis :
>
>
> == callpeak macs2 settings
> Genome size (hs,mm) : mm
> Walltime (macs2) : 23h
> Max. memory (macs2) : 15G
> Cap number of peaks (macs2) : 300K
> Extra parameters for macs2 callpeak :
>
>
> == callpeak naiver overlap settings
> Bedtools intersect -nonamecheck : false
>
>
> == IDR settings
> Append IDR threshold to IDR out_dir : false
>
>
> == ATAQC settings
> TSS enrichment bed :
> DNase bed for ataqc :
> Promoter bed for ataqc :
> Enhancer bed for ataqc :
> Reg2map for ataqc :
> Reg2map_bed for ataqc :
> Roadmap metadata for ataqc :
> Max. memory for ATAQC : 20G
> Walltime for ATAQC : 47h
>
>
> == atac pipeline settings
> Type of pipeline : atac-seq
> Align only : false
> # reads to subsample replicates (0 if no subsampling) : 0
> # reads to subsample for cross-corr. analysis : 25000000
> No pseudo replicates : false
> No ATAQC (advanced QC report) : false
> No Cross-corr. analysis : false
> Use CSEM for alignment : false
> Smoothing window for MACS2 : 150
> DNase Seq : false
> IDR threshold : 0.1
> Force to use ENCODE3 parameter set : false
> Force to use ENCODE parameter set : false
> Disable genome browser tracks : false
> p-val thresh. for overlapped peaks : 0.01
> MACS2 p-val thresh. for peaks : 0.01
> MACS2 p-val thresh. for BIGWIGs : 0.01
> Enable IDR on called peaks : false
> Automatically find/trim adapters : false
>
> == checking atac parameters ...
> 00:00:00.925 Error (modules/align_bowtie2.bds, line 44, pos 3):
Bowtie2 index (-bwt2_idx) doesn't exists! (file: .1.bt2 or .1.bt2l)
>
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
>
<https://github.com/kundajelab/atac_dnase_pipelines/issues/126#
issuecomment-401841414>,
> or mute the thread
>
<https://github.com/notifications/unsubscribe-auth/AIOd_
GZ5SjTMj8HWBqaLWCUd0eu31tovks5uCjq6gaJpZM4U3hfe>
> .
>
--
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub [1], or mute the
thread [2].

Links:
------
[1]
https://github.com/kundajelab/atac_dnase_pipelines/issues/
126#issuecomment-401843182
[2]
https://github.com/notifications/unsubscribe-auth/Aas3Z7KbU-
6dm79PMiLj4ZZhu-P7kzifks5uCjwhgaJpZM4U3hfe

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kundajelab/atac_dnase_pipelines/issues/126#issuecomment-401853860, or mute the thread https://github.com/notifications/unsubscribe-auth/AIOd_EHy48O8DwLF7zeTrc2FUT6DHS0nks5uCkSGgaJpZM4U3hfe .

rehamFatima commented 6 years ago



## Get hostname with the following command: 
## $ hostname -f
##
## Configure environment per hostname:
## [hostname1]
## ...
##
## Use the same environment for multiple hostnames:
## [hostname2, hostname3, ...]
## ...
##
## Using group
## [hostname1, hostname2, ... : group]
## [group]
## ...
##
## Using an asterisk in hostnames (IMPORTANT: only one * is allowed in hostnames)
##
## [host*name1]
##
## [*hostname2, hostname3*]

# Stanford Kundaje group clusters (out of SGE)
[vayu, mitra, durga]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
conda_bin_dir   = /software/miniconda3/bin
species_file    = $script_dir/species/kundaje.conf
unlimited_mem_wt= true      # unlimited max. memory and walltime on Kundaje clusters
nice        = 20
nth         = 3

# Stanford Kundaje group clusters (controlled with SGE)
[nandi, kali, amold, wotan, kadru, surya, indra]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
conda_bin_dir   = /software/miniconda3/bin
species_file    = $script_dir/species/kundaje.conf
unlimited_mem_wt= true      # unlimited max. memory and walltime on Kundaje clusters
system      = sge
nice        = 20
nth         = 3

# Stanford NEW SCG
[*.scg.stanford.edu, dper730xd*, hppsl230s*, dper910*, sgiuv*, sgisummit*, smsx10srw*]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
species_file    = $script_dir/species/scg.conf
nth     = 4     # number of threads for each pipeline
system      = slurm     # force to use SLURM SCG
q_for_slurm_account = true  # use --account instead of -p (partition)
cluster_task_delay = 10     # for NFS delayed write

# Stanford OLD SCG4
[scg*.stanford.edu, scg*.local, carmack.stanford.edu, crick.stanford.edu]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
species_file    = $script_dir/species/scg.conf
nth     = 4     # number of threads for each pipeline
system      = sge       # force to use SGE (Sun Grid Engine) on SCG3/4 even though a user doesn't explicitly specify SGE on command line with 'bds -s sge atac.bds ...'
cluster_task_delay = 10     # for NFS delayed write

# Stanford Sherlock clusters
[sherlock*.stanford.edu, sh-*.local, sh-*.int, sh-ln*.stanford.edu]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
species_file    = $script_dir/species/sherlock.conf
nth     = 4     # number of threads for each pipeline
system      = slurm
cluster_task_delay = 30     # for NFS delayed write

# default
[default]
conda_env   = bds_atac
conda_env_py3   = bds_atac_py3
species_file    = /nfs/nobackup/ensembl/reham_ens/genome/bds_atac_species.conf  # DEF_SPECIES_FILE: DO NOT MODIFY THIS COMMENT (install_genome_data.sh WILL NOT WORK).```

rehamFatima commented 6 years ago

$ ls -l /nfs/nobackup/ensembl/reham_ens/genome/mm10/

total 4327335
drwxr-sr-x 2 reham ensembl        347 Jun 29 17:32 ataqc
drwxr-sr-x 2 reham ensembl        429 Jul  2 10:01 bowtie2_index
-rw-r--r-- 1 reham ensembl       1263 Oct 22  2016 mm10.blacklist.bed.gz
-rw-r--r-- 1 reham ensembl       1405 Jul  2 10:01 mm10.chrom.sizes
-rw-r--r-- 1 reham ensembl   26078281 Jul  2 14:22 mm10_enh_dhs.bam
-rw-r--r-- 1 reham ensembl 2785490220 Jul  2 09:46 mm10_no_alt_analysis_set_ENCODE.fasta
-rw-r--r-- 1 reham ensembl       2511 Jul  2 10:01 mm10_no_alt_analysis_set_ENCODE.fasta.fai
-rw-r--r-- 1 reham ensembl  870379095 Jan 23  2016 mm10_no_alt_analysis_set_ENCODE.fasta.gz
drwxr-sr-x 2 reham ensembl       2544 Jul  2 09:46 seq

( I have had to reinstall the genome data to another location )

rehamFatima commented 6 years ago

$ ls -l /nfs/nobackup/ensembl/reham_ens/genome
total 62103
-rw-r--r-- 1 reham ensembl     1124 Jul  2 10:01 bds_atac_species.conf
-rwxr--r-- 1 reham ensembl 26286699 Jun 29 17:45 ERR1659027_1.fastq
-rwxr--r-- 1 reham ensembl 26286699 Jun 29 17:45 ERR1659027_2.fastq
drwxr-sr-x 5 reham ensembl      410 Jul  2 14:22 mm10

and

ls -l /homes/reham/ATAC-Seq/genomes/
total 1901192
-rwxr--r-- 1 reham ensembl 1884648321 Jun 14 14:52 ENCFF124LBK.fastq.gz
drwxr-xr-x 3 reham ensembl       4096 Jun 29 17:32 mm10
-rw-r--r-- 1 reham ensembl   26079643 Jun 14 10:59 mm10_enh_dhs.bam
-rw-r--r-- 1 reham ensembl    2342848 Jun 29 16:30 mm10_enh_dhs.bam.bai
-rw-r--r-- 1 reham ensembl   26079835 Jun 29 16:29 mm10_enh_dhs.bam.sorted

I hope these will help. Many thanks

leepc12 commented 6 years ago

Please add -species_file /nfs/nobackup/ensembl/reham_ens/genome/bds_atac_species.conf to the command line and try again.

rehamFatima commented 6 years ago

Command :

bds atac.bds -species mm10 -species_file /nfs/../genome/bds_atac_species.conf -gensz mm -nth 1 -bam1 /nfs/../genome/mm10/mm10_enh_dhs.bam -system local -no_par

Output :

== checking input files ...

Rep1 bam (PE) :
    /nfs/../genome/mm10/mm10_enh_dhs.bam
Distributing 1 to ... 
{1=1}
awk: cmd. line:1: fatal: division by zero attempted
Task failed:
    Program & line     : 'modules/postalign_bam.bds', line 341
    Task Name          : 'dedup_bam_PE_2 rep1'
    Task ID            : 'atac.bds.20180706_093826_930/task.postalign_bam.dedup_bam_PE_2_rep1.line_341.id_10'
    Task PID           : '1925'
    Task hint          : 'samtools view -F 1804 -f 2 -b /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam; sambamba index -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam; s'
    Task resources     : 'cpus: -1  mem: -1.0 B wall-timeout: 8640000'
    State              : 'ERROR'
    Dependency state   : 'ERROR'
    Retries available  : '1'
    Input files        : '[/homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam]'
    Output files       : '[/homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam, /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.flagstat.qc, /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.pbc.qc]'
    Script file        : '/homes/reham/ATAC-Seq/atac_dnase_pipelines/atac.bds.20180706_093826_930/task.postalign_bam.dedup_bam_PE_2_rep1.line_341.id_10.sh'
    Exit status        : '1'
    Program            : 

        # SYS command. line 343

         if [[ -f $(which conda) && $(conda env list | grep bds_atac | wc -l) != "0" ]]; then source activate bds_atac; sleep 5; fi;  export PATH=/homes/reham/ATAC-Seq/atac_dnase_pipelines/.:/homes/reham/ATAC-Seq/atac_dnase_pipelines/modules:/homes/reham/ATAC-Seq/atac_dnase_pipelines/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

        # SYS command. line 350

         samtools view -F 1804 -f 2 -b /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam

        # SYS command. line 352

         sambamba index -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam

        # SYS command. line 354

         sambamba flagstat -t 1 /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.nodup.bam > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.flagstat.qc

        # SYS command. line 365

         sambamba sort -t 1 -n /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam -o /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam

        # SYS command. line 367

         bedtools bamtobed -bedpe -i /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam | \
                            awk 'BEGIN{OFS="\t"}{print $1,$2,$4,$6,$9,$10}' | \
                            grep -v 'chrM' | sort | uniq -c | \
                            awk 'BEGIN{mt=0;m0=0;m1=0;m2=0} ($1==1){m1=m1+1} ($1==2){m2=m2+1} {m0=m0+1} {mt=mt+$1} END{m1_m2=-1.0; if(m2>0) m1_m2=m1/m2; printf "%d\t%d\t%d\t%d\t%f\t%f\t%f\n",mt,m0,m1,m2,m0/mt,m1/m0,m1_m2}' > /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/qc/rep1/mm10_enh_dhs.nodup.pbc.qc

        # SYS command. line 371

         rm -f /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam

        # SYS command. line 373

         TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
    StdErr (100000000 lines)  :
        awk: cmd. line:1: fatal: division by zero attempted

Fatal error: modules/postalign_bam.bds, line 378, pos 4. Task/s failed.

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

leepc12 commented 6 years ago

Can you share your BAM files for debugging?

original: /nfs/nobackup/ensembl/reham_ens/genome/mm10/mm10_enh_dhs.bam processed: /homes/reham/ATAC-Seq/atac_dnase_pipelines/out/align/rep1/mm10_enh_dhs.dupmark.bam.tmp.bam

Is your input BAM file filtered/deduped? any quality problem with the input BAM file?

rehamFatima commented 6 years ago

PFA

BamFiles.zip

I just used the bed files within the mm10 genome installed and generated the bam file using bedtools bedtobam. Could you tell how or what should I be looking for in terms of quality problem with the bam file? Thank you

leepc12 commented 6 years ago

That BED file in the mm10 genome data is not for testing purpose.

If you just wanted to test our pipeline. I recommend to use a new pipeline https://github.com/ENCODE-DCC/atac-seq-pipeline.

rehamFatima commented 6 years ago

Thanks Jin, if you could please tell me what data set you used to generate the files on your main page. As I have tried even ATAC seq data sets from ENA.

leepc12 commented 6 years ago

We recommend to use our new pipeline at https://github.com/ENCODE-DCC/atac-seq-pipeline/. This new pipeline supports docker, DNANexus and Google Cloud platform so that you will avoid any dependency problems.

Please read through documentation and then find an example at /examples/scg, download data from the ENCODE portal and try running pipelines.

rehamFatima commented 6 years ago

Thank you Jin. May I ask how the atac-seq uses phantompeakqualtools ? as it is seemingly tripping there. Thanks

leepc12 commented 6 years ago

@rehamFatima: Sure, run_spp.R is used for cross-correlation analysis. Actual shell command lines are here https://github.com/kundajelab/atac_dnase_pipelines/blob/master/modules/postalign_xcor.bds#L113

What kind of error did you get?

leepc12 commented 6 years ago

@rehamFatima I actually don't understand why you want to test our pipeline with a BED file (or its BAM conversion) which is not designed to be used as a valid input to the pipeline.

The error you got in the issue #128 is input data quality problem. Please try with some samples on the ENCODE portal.

Instead of using this old pipeline, can you work with our new pipeline?

A very detailed tutorial with sample data is available at https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/hotfix_PIP-357/docs/tutorial_google.md

You may need to sign up for Google Cloud Platform.

kundajelab / atac_dnase_pipelines

Data set #126