Project 5: A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays

ttimbers commented 8 years ago

Project: Transcription factor binding sites are important regulatory elements found upstream or downstream of a gene's transcription start site. These DNA-binding sites are non-exact, often represented by positional probabilities in a matrix, and also appear to have slightly different affinities across different ChIP-Seq assays. Here, we propose a framework to evaluate profiles from DNA-binding site collections (JASPAR, HocoMoco, UniPROBE, Jolma et al., TRANSFAC) versus what is found in peaks called from ChIP-Seq assays. The input is a position-weight matrix (PWM) representing the DNA profile for a given binding site of interest. The first part would automatically query the ENCODE project's API for experiments targeting the appropriate gene for the profile. The sequence at the respective peaks would be extracted for scanning using the PWM. The goal is to find how well the PWM agrees with what's found in experimental data. The output would be a summary of the profile's representation across sequences, and statistics on the number of possible matches found per sequences. Depending on which experiments are queried, further aims can include:

Comparing the profiles from alternative databases and versions to identify the most accurate representations per experiments.
Determine whether a database better represents a given organism's binding site (Mouse or Human).
Using the same approach, identify profiles for binding sites not targeted by the experiment but also frequently located on the peaks.

Ideally, this project would be about 1.5-2.0 days of development, and 1-1.5 days of experimentation and attempt to answer questions using the project. Interesting skills for these projects would include: -Software development, scripting, object-oriented programming, REST APIs.

Experience with transcription factor binding sites, motif discovery.
Prior research with transcription factors and co-factor interactions.

Project Lead: Manuel Belmadani / @mbelmadani / Industry Professional / University of British Columbia

sjackman commented 8 years ago

We're planning to have a Docker image with a bunch of bioinformatics software preinstalled running on machines at the BC Cancer Agency Genome Sciences Centre during the Hackathon. Which bioinformatics software do you plant to use for your project? In particular, is there any software that you plan to use that is not already listed here? http://www.bcgsc.ca/services/orca

mbelmadani commented 8 years ago

Hi Shaun,

Most of the tools I had in mind are either straightforward to install or already in the ORCA image. But just in case, here's a few extras I was thinking of using:

MOODS - https://www.cs.helsinki.fi/group/pssmfind/ - PWM matching algorithms

Also, MOODS requires a C++ compiler and probably the package python-dev (headers needed to build Python C extensions.)

I see that MEME is listed in ORCA software. Is this the entire MEME Suite, or just the MEME motif discovery tool? The MEME Suite also includes bunch of relevant tools we may use, so I wouldn't mind having the suite installed, if possible. http://meme-suite.org/doc/download.html?man_type=web

On the same page, there's the "Motif Databases" link which we'd probably need. They're just plain text files, but if you can also go ahead and pre-download them on the image. Let me know where they will sit on the filesystem!

That's all I can think of right now. Thanks for looking into this!

Cheers,

sjackman commented 8 years ago

Hi, Manuel. I believe MEME is the whole suite. http://meme-suite.org/meme-software/4.10.1/meme_4.10.1_3.tar.gz It's installed using Homebrew/Linuxbrew brew install homebrew/science/meme See http://brew.sh and http://linuxbrew.sh

I'll create a ticket to install MOODS. https://github.com/hackseq/October_2016/issues/41

We'll download data/databases at the start of the Hackathon, unless they're unusually large and downloading is expected to be a delay to being productive, in which case we can look into download them in advance.

mbelmadani commented 8 years ago

That's great, thanks!

It should be fine to download the databases the first day of the Hackathon.

mbelmadani commented 8 years ago

I'll start listing some papers that might provide context for this project. I'm not expecting anyone to read all of this, but definitely some of the ideas introduced in these can be relevant. I'll also update/edit this post with more content as I go.

Background:

Potentially useful ideas from the literature:

hackseq / hackseq_projects_2016

Project 5: A framework to evaluate profiles from DNA-binding site collections represented in peak sequences from ChIP-Seq assays #6