loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
188 stars 39 forks source link

TFBScan to scan the whole genome #86

Closed elia427 closed 3 years ago

elia427 commented 3 years ago

I would like to use TFBScan (default) to scan the whole genome and do a general comparison with BINDetect results to show how much BINDetect filters out motifs. I see some motifs detected in a specific region by BINDetect but not when I used TFBScan. How could this happen? The default p-value for motif matches is 0.0001. Does BINDetect use a different threshold? I thought BINDetect use TFBScan with default parameters. Did I miss something?

Thanks.

msbentsen commented 3 years ago

This can happen due to the GC content used for the genome background. In both BINDetect and TFBScan, the GC content is estimated from the input sequences. If you use the whole genome for TFBScan, this will be different than the one estimated in BINDetect, and can therefore return slightly different motif positions.

What you can do is to set --gc within TFBScan to the same amount that it was in BINDetect or use the same input --regions. I expect that should solve the issue, but please let me know if it doesn't, thanks!

elia427 commented 3 years ago

Thanks. 47.65% GC content was estimated for peaks and 0.38044 (38%) GC content for the whole genome. motif_pvalue is set to 0.0001 (default) in BINDetect. Specifying the --regions is not optional in my case because it will limit the region to scan. The only difference is the 10% GC content. I tested TFBScan for one missed motif with the additional --gc 0.4765 parameter, and it is detected now. Should I re-scan the genome with the peaks GC content or leave it as it is and explain how GC content make a significant difference to detect motifs with and without peaks? Thanks again.

msbentsen commented 3 years ago

I think it depends on your application - if you need to compare directly between TFBScan and BINDetect, I would set the GC content as the same. But if you need to use the TFBS in a more general way, I think it would be okay to explain which GC content was used. It shouldn't make a huge difference in large scale analysis.