Project 4: Pseudo-WGS variant calling for common cell types aggregating ChIP-seq, RNA-seq and DHS from ENCODE and Roadmap Epigenomics data

ttimbers commented 8 years ago

Project: Pseudo-WGS variant calling for common cell lines mining ENCODE and Roadmap Epigenomics data Recent progress in sequencing technologies have led to a massive production of epigenetic datasets for many cell lines, thanks mainly to big consortia like ENCODE. This resource has helped tremendously to understand the role of non-coding variants and to prioritize the search of putative causal SNPs trough cell-type-specific functional regions such as active or repressed promoters, enhancers, insulators, repressed or open chromatin. On the other hand we still miss genotype information for almost all the cell types profiled limiting the power of allele specific alignments and further analysis that require the right reference genome. To fill this gap the goal of this project is to recover genotype information such as SNPs for common cell types aggregating ChIP-seq, RNA-seq, DHS and other sequencing data already produced and available from the ENCODE and the Roadmap Epigenomics consortia. The basic idea is that although single assays for each cell line don't have enough coverage in many regions of the genome, aggregating sequencing information for all the available assays for each cell type provides enough power to recover variants with high confidence. Given the limited amount of time we can establish the performances of the proposed pipeline analyzing one common cell line (for example GM12878) for which WGS data is already available. If we have time we can also create a simple web app to let people explore any region for SNPs of interest. If this initial phase is successful the long term vision will be to provide .VCF files for many cell types ready to use for the community, mining already available and public data.

Project Lead: Luca Pinello / @lucapinello / Assistant Professor / Dana-Farber Cancer Institute

sjackman commented 8 years ago

We're planning to have a Docker image with a bunch of bioinformatics software preinstalled running on machines at the BC Cancer Agency Genome Sciences Centre during the Hackathon. Which bioinformatics software do you plant to use for your project? In particular, is there any software that you plan to use that is not already listed here? http://www.bcgsc.ca/services/orca

lucapinello commented 8 years ago

Hi Shaun, thanks for asking, could you please add GATK?

Also what is memory limit for those images?

Thanks,

Luca

ttimbers commented 8 years ago

@lucapinello I received a question about your project from a potential hackseq participant:

Could you kindly provide some details on what the proposed pipeline for the project entails (like does it involve alignment through variant calling, SNP calling using what softwares etc) and what kind of programming skills are required?

Any info you can provide to answer this question would be of great help to this potential participant, as well as others! Thanks!

lucapinello commented 8 years ago

Hi Tiffany, happy to provide more details about the project.

Some basic programming skills are required. Web development skills can be also useful but not required.

We will probably start from already aligned files (.bam) from the ENCODE project and only if necessary we will realign some data.

About the software we are going to use some of these tools: fastqc, samtools, bowtie2, hisat, gatk, bedtools, picard...

To validate the pipeline we will use data from genome in a bottle: https://sites.stanford.edu/abms/giab

Since it is an hackaton I don't have an already established pipeline, the fun is to build it together!

Thanks to you and looking forward to the event.

Luca

sjackman commented 8 years ago

thanks for asking, could you please add GATK?

GATK has a non-commercial license agreement that makes it difficult to install in an automated fashion. I'll see what I can do. I believe there's a free version, but I don't believe that it includes all the tools. If it's not possible to get it in the image, we can install it at the start of the hackathon.

Also what is memory limit for those images?

The machines have 16 CPU, 64 GB of RAM and 1 TB of hard drive.

lucapinello commented 8 years ago

Hi Shaun, sounds good!

Thanks,

Luca

lucapinello commented 7 years ago

Hi Shaun, were you able to install and test the latest version of GATK on the machines?

Thanks,

Luca

sjackman commented 7 years ago

Nearly. I have created a formula for GATK for Homebrew-Science at https://github.com/Homebrew/homebrew-science/pull/4426 I'm encountering a very strange error. I'll hopefully have it sorted out before Hackseq. The workaround, installing GATK manually, is easy enough.

lucapinello commented 7 years ago

Thanks a lot!

sjackman commented 7 years ago

gatk is in Homebrew-Science now! brew install homebrew/science/gatk

hackseq / hackseq_projects_2016

Project 4: Pseudo-WGS variant calling for common cell types aggregating ChIP-seq, RNA-seq and DHS from ENCODE and Roadmap Epigenomics data #7