loosolab / TOBIAS

Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
MIT License
188 stars 39 forks source link

Overrepresentation of TF with homopolimeric binding motifs #56

Closed pblpez closed 3 years ago

pblpez commented 3 years ago

Hello msbentsen and thank you for creating TOBIAS. I have some questions coming from a problem:

I have performed all TOBIAS steps from ATACorrect and I have created a few networks without any problem and visualized them in cytoscape. The thing is that when I rank the nodes (TF) of my network following different metrics (degree, betweeness, closeness) I observe a clear overrepresentation of TF with homopolimeric binding motifs (specially those G/C polimeric motifs). In this manner, the top transcription factors with the highest number of footprints correspond to TF with low expression levels following the expression experiments performed in our lab. And as I say, they have in common that their binding motifs are homopolimeric.

I was wondering if this is a typical problem and if there is a way to establish a metric to handle with those noisy results, which leads me to my second question.

I have seen in the BINDetect results a metric called "TFBS_score". You defined it in a previous issue as follows: "The TFBS_score is the score of the motif match to the underlying sequence, and is therefore independent of the footprinting. The higher the score, the better the sequence matched to the input motif." Then I thought it could be the metric I need to filter the noisy results, but I could not find in the documentation how it is calculated. Moreover, I do not totally understand its value as some of those TFBS_score have negative values. How is that score calculated? What does a negative value means? Could I establish a threshold for this score to help me filter those overrepresented TF or there is already an internal threshold when performing BINDetect making every match a significant one?

Thank you for your help!

msbentsen commented 3 years ago

Hi!

I think there are a couple of questions here, which I will try to answer:

The thing is that when I rank the nodes (TF) of my network following different metrics (degree, betweeness, closeness) I observe a clear overrepresentation of TF with homopolimeric binding motifs (specially those G/C polimeric motifs). In this manner, the top transcription factors with the highest number of footprints correspond to TF with low expression levels following the expression experiments performed in our lab.

I am not exactly sure what is meant by homopolimeric binding motifs, but the enrichment of GC-motifs can occur due to the sequence-background of promoters/enhancers, which (at least in human/mouse) have a higher GC-content than the rest of the genome. This means that GC-rich motifs will have more binding-sites found in these regions than other TFs. This also has a biological influence, so the effect is difficult to correct for.

Regarding expression, this is unfortunately a common problem due to high TF-motif similarity. For transcription factor families (such as homeodomains, bZIP etc.), it is sometimes hard to distinguish between the different TFs in the footprinting analysis, and a footprint might therefore be driven by another TF than itself. For that reason, footprints are not always easy to match with expression.


I have seen in the BINDetect results a metric called "TFBS_score". You defined it in a previous issue as follows: "The TFBS_score is the score of the motif match to the underlying sequence, and is therefore independent of the footprinting. The higher the score, the better the sequence matched to the input motif."

Could I establish a threshold for this score to help me filter those overrepresented TF or there is already an internal threshold when performing BINDetect making every match a significant one?

This score is calculated by the match of the motif PSSM to the sequence. The actual scoring is performed by an external module within python (MOODS; scanner setup in the code here), so I would refer you to the MOODS source code for the exact score. I don't know about the negative scores - I have never seen that before. I guess that would happen if the sequence fits poorly to the motif, but was still high enough to reach threshold.

For the threshold, there is already an internal threshold built in, which limits the potential binding sites for each TF (controlled by option --motif-pvalue in TOBIAS BINDetect). However, if you are finding a lot of spurious binding sites, you could create your own threshold post-run to filter your sites. You could also filter on the raw footprint score to ensure, that the binding sites actually have signal. I unfortunately can't tell you exactly what will work, but I hope you can solve your issue with a bit of filtering!

Best regards, Mette