kundajelab / atac_dnase_pipelines

ATAC-seq and DNase-seq processing pipeline
BSD 3-Clause "New" or "Revised" License
162 stars 81 forks source link

Comparing Treatments with Multiple Paired End Replicates #125

Open raboul101 opened 6 years ago

raboul101 commented 6 years ago

I have an ATACseq data set that includes three different treatments, each with three biological replicates. I have paired-end fastq files for each replicate. The question is: Can the fastqs for each treatment be run through the pipeline simultaneously, or must they be run separately and then compared through post-processing?

If treatments can be run simultaneously, could you provide an example how to properly phrase the BDS command? For more clarifcation, see below --

The Usage section of "https://github.com/kundajelab/atac_dnase_pipelines" states the following:

"For multiple replicates (PE), specify fastqs with -fastq[REPID][PAIRID]. Add -fastq[][] for each replicate and pair to the command line:replicates.

-fastq1_1 [READ_REP1_PAIR1] -fastq1_2 [READ_REP1_PAIR2] -fastq2_1 [READ_REP2_PAIR1] -fastq2_1 [READ_REP2_PAIR2] .."

This seems to suggest that one can only enter bioreps for one treatment, e.g. -fastq1_1 trt1_rep1_R1.fastq.gz -fastq1_2 trt1_rep1_R2.fastq.gz -fastq2_1 trt1_rep2_R1.fastq.gz and so on. I don't see any clear way to denote treatment. An example of this would be very helpful, if possible.

akundaje commented 6 years ago

You should run each of the treatments separately (bioreps for each treatment together). Then you need to use a differential analysis package to identify differential peaks. You can use the union of naive overlap peaks across all conditions as your complete set of peaks. Quantify read counts in each peak in each of the replicates and treatments. Then run these through DESeq2 or EdgeR or some other differential count analysis method.

raboul101 commented 6 years ago

Thank you. But now I have another question: I see that the pipeline version I used (downloaded sometime in April) is now deprecated. Should I abandon my previous results and go with the new pipeline? The new version is implemented through Docker (which i have installed). However, the instructions for use are somewhat bewildering. Where is the best place to look for a clear set of use instructions? ... I am not familiar with DNAnexus, and it appears to be a fee-based service.

akundaje commented 6 years ago

No you dont have to re-run the pipeline. Its the same pipeline just dockerized so installs more easily on several platforms. We will improve installation and usage instructions of the new version (@leepc12 Note we need to improve documentation for the new version of the pipelines). I would suggest switching to when you can because we will only be developing the docker version going forward.

akundaje commented 6 years ago

@raboul101 Could you give us specifics on which part of the installation process with the new pipeline you found confusing. We are starting to improve documentation so best to get specific feedback from users. Thanks!

leepc12 commented 6 years ago

https://encode-dcc.github.io/wdl-pipelines/install.html#local-computer-with-docker

@raboul101: We are sorry about that, we wanted to have a unified documentation for all pipelines but that made users confusing. We will update the documentation. Until then, please let me know which step made you confusing. Also, please feel free to post issues on the new pipeline github repo (or here).

[MINICONDA3_INSTALL_DIR]: where you installed miniconda3 [WDL_PIPELINE_DIR] : where you installed the pipeline (git directory)

java -jar -Dconfig.file=backends/backend.conf cromwell-30.2.jar run atac.wdl -i input.json -o workflow_opts/docker.json

New pipeline takes in a JSON file instead of parameters defined in command line arguments. input.json description: https://encode-dcc.github.io/wdl-pipelines/input_json_atac.html

You can find examples on /examples/klab. You may need to change genome TSV file path and paths for FASTQs.

raboul101 commented 6 years ago

Sorry for the late reply. Since I already have results with the old pipeline, I haven't proceeded with installing the docker-based pipeline. However, what was confusing was the input.json file. As I understand the new process, you 1: download the genome data (along with the associated genome.tsv -- this downloads along with the genome, correct?), 2: add this genome.tsv, along with input files and desired options, to the input.json file, 3: run the pipeline with the command listed in the previous comment (above).

So,

Where does one obtain a template input.json, or if it needs to be created de novo, what is the proper format,

What is the backend.conf file, and/or where is it?

My main hang-up is where to get or how to create the .json, I think clearing that up will help greatly. And thank you for putting these pipelines together, they are a great resource.

leepc12 commented 6 years ago

There are many template input JSON files in /examples/ (for each platform, pick up any JSON in /examples/klab/ for local pipeline running).

bandend.conf is in /backends/.

We strongly recommend that users need to docker for the pipeline so that annoying dependency issues will not occur.

Sorry, I am still working on the documentation, will update it soon.

raboul101 commented 6 years ago

What is the full path for those JSON examples? I don't see them in github: kundajelab/atac_dnase_pipelines/examples

leepc12 commented 6 years ago

New pipeline repo is https://github.com/ENCODE-DCC/atac-seq-pipeline/

raboul101 commented 6 years ago

Aha! That clears it up. Thank you again.

vervacity commented 6 years ago

Hi llz-hiv, please repost this as a separate issue, as this is not related to the above thread. Also please consider subscribing to our pipelines google group, which may have additional useful information as you consider downstream analyses :) https://groups.google.com/forum/#!forum/klab_genomic_pipelines_discuss