Possible to select one motifs cluster per TF?

vierstralab / motif-clustering

Clustering motif models to remove redundancy

37 stars 8 forks source link

Possible to select one motifs cluster per TF? #5

Open vitkl opened 2 years ago

vitkl commented 2 years ago

I see a lot of cases when a TF has motifs falling into distinct clusters. Is there a preferred strategy to select the likely correct WT motif for each TF and allocate each TF to one motif cluster?

jvierstra commented 2 years ago

The link between TF genes and motifs is a complicated one. Any motifs can be recognized by many TFs -- in fact Tim Hughes and Matt Wierauch have shown that if a DNA-binding domains amino acid sequence is sufficiently similar the binding sites are identical. I would focus on the compatible TFs for each motif (CIS-BP; http://cisbp.ccbr.utoronto.ca/) does a pretty good job. I have the human in metadata files for CIS-BP here. The file 'TF_information*.txt' will have that information. The 'DBID' column is the ENSEMBL gene id. See the 'README.txt' file for the definition of that columns (and additional files).

I hope you find this helpful.

vitkl commented 2 years ago

I understand that multiple TFs can recognise the same DNA sequence motif. Here, I am asking about the opposite case where one TF appears to recognise multiple motifs based on your database. Several motifs associated with one TF can be so different as to fall into different motif clusters. I find this result surprising and therefore would be good to understand which motif cluster corresponds to WT TFs (I saw that some studies in your database report very different motifs for mutated version of TF protein). Would be good to hear what you think about this problem (one TF -> many distinct motifs).

Thanks for the reference to CISBP mapping - this is very useful.

jvierstra commented 2 years ago

The annotations that connect TF genes to motifs are messy (I largely lifted those annotations from CIS-PB). The gene names "tf_name" field in the metadata file correspond to the TF protein that was used to determine specificity (SELEX, PBM, ChIP-seq, etc). Note that for some of these methods (i.e., ChIP-seq) the TF is not certain and the motifs can vary. The mutant TF motifs are from a paper from Martha Bulyks group from 2016 that looked at human SNVs in TF binding domains (Science mag, if I remember correctly...). You could filter those out by the PMID field corresponding to that paper.

Typically I try not to rely on them for analysis because of the problem you laid out. I would be interested in coming up with a better motif-TF gene annotation.

Does that help?