rajanil / msCentipede

A hierarchical multiscale model for inferring transcription factor binding from chromatin accessibility data.
MIT License
25 stars 6 forks source link

General advices : msCENTIPEDE on ATAC-seq #11

Open PFRoux opened 8 years ago

PFRoux commented 8 years ago

Hi !

I am trying tu use msCENTIPEDE to generate TF footprints for several ATAC-seq data sets i've produced.

I wanted to have general advices on how to achieve the most accurate workflow.

So far, what I am doing is the following :

1) Call ATAC-peaks with MACS2 2) Scan the entire genome for a given PWM with STEME with a probability cut-off at 1/1000 3) Select only motifs falling inside ATAC peaks 4) Train msCENTIPEDE based on the top 10000 motifs falling inside ATAC peaks 5) Infer on the whole motif falling inside ATAC peaks

Could you tell me please if I am on the right track ? Or is my approach somehow biased ?

I was then wondering how to define a cut-off using the statistics reported by msCENTIPEDE to focus only on the most likely bound sites ? Could you please provide in the README a short description of the output of msCENTIPEDE and how to use LogPosOdds, LogPriorOdds, MultLikeRatio and NegBinLikeRatio ?

Thanks a lot,

Cheers,

Pef

rajanil commented 8 years ago

Hi @PFRoux, For step (4), I would recommend training the model using the top 10000 motifs from the entire genome (with top motifs selected based on a PWM score). In some previous analyses, I had noticed that training the model using only motifs within ATAC-seq or DNase-seq peaks can lead to odd shapes for the cleavage profiles, and poor accuracy. The inference step (5), however, can be restricted to motifs within ATAC peaks, as you do already.

One suggested set of conditions to select the most likely bound sites is LogPosOdds > 2 AND MultLikeRatio > 1 AND NegBinLikeRatio > 1 .

LogPosOdds is the log of the posterior odds that the TF is bound. (LogPosOdds > 2 is equivalent to the binding posterior > 0.99) MultLikeRatio is the log likelihood ratio for the cleavage profile model. NegBinLikeRatio is the log likelihood ratio for the total chromatin accessibility model. (all logs are base 10)

Hope this helps!

I will add to the README a suggested workflow and descriptions of these quantities. Thanks for the suggestion.

cheers Anil

PFRoux commented 8 years ago

Hi @rajanil,

Thx a lot for your suggestions and for explaining the output of msCENTIPEDE. I'am now trying to train the model using the top 10000 motifs from the entire genome but it seems that the algorithm failed to define initial parameters. Using exactly the same inputs (except the motif set) I get the following output :

call_binding.py --task learn --protocol ATAC_seq ./pwmscan_hg19_22208_31666.bed.gz ../1-NOBLACKLIST_Bam/D0ATAC_Rep1_DEDUP_NOBLACKLIST_SORT.bam ../1-NOBLACKLIST_Bam/D0ATAC_Rep2_DEDUP_NOBLACKLIST_SORT.bam

loading motifs ... num of motif sites = 10000 loading read counts ... transforming data into multiscale representation ... starting model estimation (restart 1) Inf in LogLike initial log likelihood = -inf Nan in Omega Traceback (most recent call last): File "/mount/gensoft2/exe/msCentipede/1.0/bin/call_binding.py", line 345, in main() File "/mount/gensoft2/exe/msCentipede/1.0/bin/call_binding.py", line 338, in main learn_model(options) File "/mount/gensoft2/exe/msCentipede/1.0/bin/call_binding.py", line 104, in learn_model background_counts, options.model, options.log_file, options.restarts, options.mintol) File "mscentipede.pyx", line 1337, in mscentipede.estimate_optimal_model (mscentipede.c:31882) square_EM(data, scores, zeta, pi, tau, \ File "mscentipede.pyx", line 1134, in mscentipede.square_EM (mscentipede.c:28300) EM(data, scores, zeta, pi, tau, alpha, beta, omega, pi_null, tau_null, model) File "mscentipede.pyx", line 1062, in mscentipede.EM (mscentipede.c:27829) omega.update(zeta, alpha) File "mscentipede.pyx", line 682, in mscentipede.Omega.update (mscentipede.c:20451) raise ValueError ValueError

Could you help me please to solve this problem ?

Thanks a lot,

Cheers,

Pef