stjude / cis-x

Search for activating regulatory variants in the tumor genome
https://www.stjuderesearch.org/site/lab/zhang/cis-x
Apache License 2.0
14 stars 8 forks source link

Flawed Transcription Factor Binding Analysis #7

Open DarioS opened 4 years ago

DarioS commented 4 years ago

... transcription factor motif analysis was carried out with the FIMO package, with a p-value threshold of 0.001. A total of 614 human transcription factor binding motifs from the HOCOMOCO database were included. Only mutations that could introduce a transcription factor binding motif that was absent from the reference sequence were kept for downstream analysis.

The two most famous non-coding variants are at chr5 1295113 and chr5 1295135 in hg38. It has been experimentally determined that the TERT promoter is bound by GABPA. I took the reference genome sequence including them and 20 bases on either side and found that the reference genome has a FIMO p-value below 0.001 for one of them:

refGenomeTERT

I also used FIMO with the hotspot mutation changes incorporated into the sequence.

TERTmutant

Both of the hotspot mutations have a FIMO p-value below 0.001. Therefore cis-X throws away one of the two TERT promoter hotspot mutations in every analysis. Has the software been tested to check if it is producing sensible results? I don't see unit tests.

The transcription factor binding analysis also has another statistical flaw.

Keep in mind that if using FIMO to scan a database containing hundreds of motifs, you are going to be facing a multiple testing problem. The q-values reported by FIMO correct for the size of the sequence database, but they do not correct for the number of motifs tested. You might want to apply a Bonferroni correction to the q-values, setting the threshold for statistical significance perhaps to 0.01 divided by number-of-motifs.

yliu2014 commented 4 years ago

Hi DarioS,

We have repeated the problem you presented here. We are implementing a signal to noise type approach and it should solve the problem of this kind. We will update asap. Thank you for bringing this up to us.

In current version of code, we have integrated other criteria in transcription factor motif prediction besides the statistical evidence. This includes the expression level of the predicted transcription factor and the absence of prediction in reference sequence, which will be updated to consider the signal to noise ratio between ref and alt sequences. We applied this combined approach to control the false discovery rate in the motif analysis.

DarioS commented 4 years ago

Glad it could be reproduced and will be improved. Looking forward to the software update, but first I go on holidays for a week.