slowkow / CENTIPEDE.tutorial

:bug: How to use CENTIPEDE to determine if a transcription factor is bound.
https://slowkow.github.io/CENTIPEDE.tutorial
25 stars 13 forks source link

About FIMO finding putative TFBS in the tutorial. #12

Open alexyfyf opened 6 years ago

alexyfyf commented 6 years ago

I read the CENTIPEDE manuscript. I think the author scanned all putative TFBS across the genome.

However, in your tutorial, you suggested using FIMO only to obtain TFBS in peak regions.

Is it proper to that? Do you have any comment about this?

Thank you.

slowkow commented 6 years ago

Thanks for the question!

At the time of writing, I was mostly concerned with how to get the data into the right format so that we can run CENTIPEDE in the first place.

  1. It would be fantastic if you could provide a concrete example from the text that highlights the difference between the authors' manuscript and my tutorial. When I created this tutorial, it was not very clear to me exactly how the method was used. I tried my best, but it is possible that I made several mistakes.

  2. You might consider updating this tutorial to be more similar to the original manuscript, as you indicated. If you'd like to make a pull request, I'd be very happy to review it. Thanks for your consideration!

alexyfyf commented 6 years ago

Thanks for you reply.

CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by
particular transcription factors. It starts by identifying a set of candidate binding sites (e.g., sites that match a
certain position weight matrix (PWM)), and then aims to classify the sites according to whether each site is bound
or not bound by a TF. CENTIPEDE is an unsupervised learning algorithm that discriminates between two different
types of motif instances using as much relevant information as possible. In brief, the procedure is as follows:
1. Scan the genome for all approximate matches to a target PWM of interest. Each site that matches the PWM
is considered a candidate binding site (Section 2.1).
We scanned the human genome sequence (hg18) for matches to each PWM using our implementation of the
following commonly used formula [2]:

This is from the supplementary data of CENTIPEDE paper. So I assume the author is scanning the whole genome other than peak region(hotspots).

But I'm not sure how much difference it will make. I haven't tested on any data yet.

slowkow commented 6 years ago

After reading the supplement again, I think you're right. Thanks for pointing out the difference.

It seems the authors consider all sites that have an approximate PWM match, regardless of other evidence such as ChIP-seq data.

In contrast, my tutorial only considers sites that have strong evidence of a DNase-seq peak.

Looking back on this, I probably found it a bit odd to consider that a site can be classified as "bound by a TF" even though it does not have any DNase-seq data. That might be the reason that I decided to run the analysis only on genomic loci with DNase-seq peaks.

I think it might be interesting to see how you decide to set up your own analysis. Please feel free to share your findings!