I've described the project to a few ppl now, and one question that comes up is as follows:
There is little dispute that higher order models for predicting TF-DNA binding are more capable than simple PWM models, there are no shortage of models that provide greater modeling capacity (e.g DeepBind, DeepSEA, DeeperBind, Basset, DanQ,...)
-Yet there was little immediate interest in another such model whose main advance was better interpretation.
Existing models (AFAIK) are interested in different predictions. DeepBind was interested in predicting bound-vs-unbound sequences, and also in determining if SNPs would change this status (esp. WRT splicing). Basset also tried an in-silico mutagenesis experiment to try and identify substitutions that would induce changes in predicting binding effects. DeepSEA did a similar experiment. DeepLIFT does not, but they do try to compute per-base importance scores.
So why do we need another model which takes a variable length sequence input and tries to predict binding information?
My currently most plausible answer:
Predicting bound versus unbound is relatively well-solved, as each of the comparators I mentioned will attest.
But usually not the most important problem for people with sequencing data in hand.
Usually, they wan to know what their sequencing data means. What is binding? What is changing at a given site between experimental conditions? What parts of a given sequence are most important in determining binding versus not binding? And can we interpret what those more important elements tell us about gene regulation?
This suggests I will want to incorporate a figure that shows that my method provides a rich and interpretable view of sequence decoding. Simpler and more precise than FIMO, or other comparators.
One simple experiment would be to interpret ChIP-seq data; sample peaks, flanks and show that I can reliably rank the factor of interest as most likely to be bound ahead of other factors, and not in flanks. Can compare to FIMO.
Christina added to this, which is that we need to figure out better ways to differentiate my own project from Han's work and Meghana's work.
The problem to focus on is attribution of information content to input sequence regions directly from the model; not by looking at bags of kmers, or what have you. Use the model to identify what regions are driving decisions, and use invitro data to offer probably explanations.
Currently she suggests training a language model on kmers, using this to encode DNA regions, then building a CNN style model on top of that to decide open vs closed in different cell types
I've described the project to a few ppl now, and one question that comes up is as follows:
-Yet there was little immediate interest in another such model whose main advance was better interpretation.
My currently most plausible answer:
This suggests I will want to incorporate a figure that shows that my method provides a rich and interpretable view of sequence decoding. Simpler and more precise than FIMO, or other comparators.