hackseq / hackseq_projects_2017

6 stars 1 forks source link

Project 2: A reproducible template workflow for single-cell DNA methylation data #2

Open abaghela opened 7 years ago

abaghela commented 7 years ago

A reproducible template workflow for single-cell DNA methylation data

DNA methylation is a heritable epigenetic mark that shows a strong correlation with transcriptional activity, and may be detected by whole genome bisulfite sequencing (WGBS). Recently, WGBS has been performed successfully on single cells (SC-WGBS). The resulting data represents a fundamental shift in the capacity to measure and interpret DNA methylation, especially in rare cell types and contexts where subtle cell-to-cell heterogeneity is crucial, such as in stem cells or cancer. However, although some software tools have been published, and several existing studies have tended to use similar methods, no standardized pipeline for the analysis of SC-WGBS yet exists. Simultaneously, there has been a drive within bioinformatics towards improved reproducibility. Recreating the exact results of a study requires not only the exact code, but also the exact software. Common Workflow Language (CWL) provides a framework for specifying complete workflows, while Docker allows for bundling of the exact software and auxiliary data used in an analysis within a container that can be executed anywhere. Together, these have the potential to enable completely reproducible bioinformatics research. At a previous Hackathon, the first steps were taken towards developing Screw, a collection of standard tools and workflows for analysing SC-WGBS data, wrapped in CWL and Docker. https://github.com/Epigenomics-Screw/Screw Screw will include quality control visualization, clustering and visualisation of cells by pairwise dissimilarity measures, construction of recapitulated-bulk methylomes from single cells of the same lineage, generation of bigWig methylation tracks for downstream visualization, and wrappers around published tools such as DeepCpG and LOLA. This project will focus on completing Screw, while also building standardised workflows to analyse a series of public SC-WGBS data sets. This will both provide a complete resource for reproducible SC-WGBS analysis, as well as a first metanalysis of SC-WGBS data.

Team Lead: Kieran O'Neill | koneill@bcgsc.ca | @oneillkza | Postdoctoral Fellow | BC Genome Sciences Centre

oneillkza commented 7 years ago

So ... software: we need Docker. As far as I can see, ORCA already works by loading a Docker container. It sounds like running Docker inside Docker is possible, but not recommended. Could we get some comment from the ORCA admins on the best way to be navigating this? Eg if we could deploy our own containers directly, or if ORCA supports Common Workflow Language.

The hacky, roundabout, defeating the whole purpose of the project solution would be to run without Docker, and ensure that the ORCA container has everything from our existing container, but it's also likely that we'll be updating what software we need as we go during the hackathon.

Besides that, we'd need:

lchong commented 7 years ago

@sjackman Can you comment on this? Would it be possible to load a different Docker image for Kieran's team when they log onto the ORCA machines?

sjackman commented 7 years ago

Hi, Kieran. cc @tmozgach

Yes, ORCA supports Common Workflow Language (CWL). It has cwltool installed. It'd be good to test it out to ensure that it works for your purpose. It does not have Arvados installed.

and ensure that the ORCA container has everything from our existing container

Here's the list of software installed on ORCA: https://github.com/bcgsc/orca/blob/master/versions.tsv Can you check whether any software is missing?

It sounds like running Docker inside Docker is possible

We'll have to discuss this and get back to you.

sjackman commented 7 years ago

@oneillkza Do you run the CWL pipeline inside a Docker container, or does your CWL pipeline launch Docker containers?

oneillkza commented 7 years ago

@sjackman it launches containers. (This is basically the default cwltool behaviour.)

In our case, it's actually one container for all of the CWL tools, hence my saying we could bundle things up in the standard ORCA container. One tricky issue is that we also bundle up the Screw codebase inside the container, so as we hack on it, we'd need to constantly update the container.

sjackman commented 7 years ago

As a first pass, would try running your pipeline using cwltool inside the bcgsc/orca container, and configure cwltool not to launch any containers?

sjackman commented 7 years ago

We haven't created the ORCA accounts yet for Hackseq, but we can create yours first if you'd like to give that a go.

oneillkza commented 7 years ago

Yeah, that'd be a reasonable solution -- it's easy enough to use the --no-container flag in cwltool. We can test the Docker functionality on our local machines on toy examples, and run the pipeline in anger on ORCA but using --no-container.

Re: list of software, most of this is described in the following Dockerfiles. If you could add these to the ORCA Dockerfile, that should do it!

https://github.com/Epigenomics-Screw/Screw/blob/master/docker/base/Dockerfile https://github.com/Epigenomics-Screw/Screw/blob/master/docker/screw/Dockerfile

Thanks!

(And yes please to getting an ORCA account for pre-testing.)

sjackman commented 7 years ago

Great. I've asked Brendan to create an ORCA account for you. In the mean time, you can test out the ORCA Docker image on your own hardware if you like: https://hub.docker.com/r/bcgsc/orca/ docker run -it bcgsc/orca. Note that it's a very large image, many gigs.

sjackman commented 7 years ago

R is installed, but the R packages are not pre-installed. You'll have to do that yourself. @tmozgach Please add methpipe to the ORCA image.

tmozgach commented 7 years ago

@sjackman Should the following software be in ORCA image for hackseq?

Install nano, vim, and emacs, man-db, methpipe 
sjackman commented 7 years ago

Yes, please. Thanks, Tanya. Please also brew install less if the command less is not already in the PATH. And bzip2 and xz if they're not already in the PATH.

tmozgach commented 7 years ago

@sjackman I will add and start to build a new image 16th of September. By this time, is that possible to ask leaders what exactly they need in terms of software or think what should we add else?

sjackman commented 7 years ago

The above are all installed.

$ which less bzip2 gzip xz
/usr/bin/less
/home/linuxbrew/.linuxbrew/bin/bzip2
/bin/gzip
/home/linuxbrew/.linuxbrew/bin/xz
sjackman commented 7 years ago

This issue is for Project 2. Could you please post in each of the other project issues pointing each team leader to the list of installed software, and asking if they need any software missing from that list?

lchong commented 7 years ago

Hi @tmozgach @sjackman

I've already asked all the team leaders to post a list of required software in their respective project issues. But I'll also start a new issue summarizing people's requests so that it's all centralized, and I'll also remind them to give feedback (not everyone has done so yet).

sjackman commented 7 years ago

Thanks, Lauren!

jakelever commented 6 years ago

Hey team lead ( @oneillkza ) , we've been gathering Github IDs for your team members. From your description, it sounds like you plan to use the existing Screw repo for this project. If that's the case, could you please add the people below as collaborators to that project? Or if you'd prefer, we can make a repo in the hackseq organisation and sort out membership for you.

cmorganl klimstef sibylgisela jesszha jjonphl adammendoza

Once the people are added, it'd be a great idea to start a discussion on that repo with information to get your team members started (e.g. some small suggested reading, things to look up, etc). We will also be adding everyone to Slack and creating a specific channel for each project. This may be an easier way to communicate.

We'll forward on any remaining Github IDs through this issue.

Thanks, Jake obo the Hackseq organising committee