loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
193 stars 41 forks source link

Minimum size of input peak regions #121

Closed Erhei130 closed 1 year ago

Erhei130 commented 2 years ago

Hello,

Thanks so much for developing TOBIAS, it's an amazing tool! We have applied TOBIAS to several of our research projects and results seem to make very good biological senses. We been applying TOBIAS to large sets of peaks that we called from MACS2 and filtered a bit (n = ~100,000 peaks). But we recently noticed that there are published paper using TOBIAS on about ~2000 peaks. So I wonder whether there is a minimum size of input peaks? For example, if a set of 2000 peaks we identified as significantly gained accessibility in one condition versus the other, can we use TOBIAS to figure out what are the TFs that differentially bind in those 2000 peaks?

And I think if the input set is only about ~2000 peaks, there will be some TFs that only have 5 or 10 motif binding site which potentially resulting in a big differential binding score, should these TFs be filtered out?

Thanks in advance!

msbentsen commented 2 years ago

Hi,

Thank you, I am happy that you find value in TOBIAS!

Regarding the number of peaks, this is a very valid concern. TOBIAS will warn if there is less than 1000 values collected from the background (~1 value collected per 200bp of input regions), but the results will be more correct with more peaks. The peaks are needed to correctly estimate the global unbiased background between conditions, so it is better to use the full peak set like you do - but I have not investigated a hard cut-off for minimum size of input peaks.

If you want to find differential binding in a subset, you can use the option --output-peaks in TOBIAS BINDetect. This will use the input --peaks for normalizing between conditions, but only the --output-peaks for estimating the binding scores. For this solution, yes, you would have to adjust for low-count TFs and potentially remove those from the results. If you only select peaks up-regulated in one condition, you will also see a very unbalanced volcano-plot (which is expected), but it might be beneficial to choose regions from both directions.

Hope this answers the question!

Erhei130 commented 2 years ago

Hi,

Thanks for your reply, it's very helpful and makes lots of sense to me. Just wonder what are some of the effects of only using peaks in one direction? Is this going to disrupt the model? And another quick question, do you think whether keeping the width of peaks within the range of 200-500 bp is necessary, some motif analysis tool suggest this, or width of peaks doesn't affect TOBIAS result too much?

Thanks again!

msbentsen commented 2 years ago

Good questions! TOBIAS assumes that the background accessibility between the two conditions is the same within the peaks (since these should be the full peaks of both conditions), and thus tries to normalize the footprint scores to the same range as seen here: image So if the peaks only contain regions from one direction, the other direction will be "force" normalized to be in the same range, which might have effects on the output. So you might see artificially high scores for some TFs.

For the peak length, I don't think it has a huge influence, but I would maybe have a look at the location of any really large peaks - just to make sure that these are not blacklisted regions or similar. The analysis of global changes of TFs considers the mean change of each TF per peak (the mean changed footprint), so the influence can be that large peaks will "average" out some local effects. But local footprints (given in the _overview-files per TF) are not affected, so I would say it is better to keep the peak length to retain the sites.