Imputation for transcription factor binding predictions based on deep learning

agitter commented 7 years ago

http://doi.org/10.1371/journal.pcbi.1005403

Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.

@jacklanchantin can you please look at this for #236?

One minor comment is that they say

Because the released DeepBind software package does not contain the training step, we could not train it on our dataset.

but I believe the DeepBind (#11) code here does include the training step. It's just hard to find from the main page.

jacklanchantin commented 7 years ago

@agitter thanks for posting this. I actually don't think that the code they provide for deepbind includes the model construction. They include pre-trained models, which you can use to score new sequences. But the deepbind model is very straightforward, I don't know why they didn't just code it themselves.

agitter commented 7 years ago

It's not important enough for me to download and check their code, but I saw text in the DeepBind README like

Train TF models on ENCODE ChIP-seq peaks, then test on held-out subset of peaks. See supplementary information if training/testing set is not clear from descriptions below. Use top 500 even to train, top 500 odd to test: python deepbind_train_encode.py top calib,train,test,report

I believe there are two versions of the code, only one of which supports training.

knowledgefold commented 7 years ago

I'm the corresponding author of this work. Your response are really fast. For DeepBind, we may miss the training code. Actually, we have implemented a simple version of DeepBind. However, the tricky part is the hyperparameter tuning. We cannot guarantee that we can repeat the whole process reported in the original DeepBind paper. Therefore, we decide to run TFImpute directly on the data used by DeepBind. We believe that it is a fair comparison.

agitter commented 7 years ago

@knowledgefold Thanks for clarifying. I think that makes a lot of sense. I also found it hard to find the DeepBind training code. Last year I posted that it wasn't available, and one of the authors corrected me by providing the link above.

We'll (eventually) discuss your paper here for our review, so please come back with comments if you have anything else to add.

greenelab / deep-review

Imputation for transcription factor binding predictions based on deep learning #258