Open elenichri opened 7 years ago
Hi Eleni, the DeepBind models in most cases do not work at all on a genome wide scale based on internal evaluation. The negative set used to train these models was shuffled versions of sequences at chip-seq peaks (that preserved dinucleotide frequencies) which is not representative of the ~99.9% of the genome that would be the negatives set for a given TF in a genome wide prediction. The actual negative set is much larger more heterogenous so models have to take that into account during training to work genome-wide.
I could recommend some alternatives if you could tell me a little more about the setup of the prediction:
1) what data can you use to impute TFBS? Do you have DNase-seq data in the target celltype or do you need these using DNA sequence only? 2) is there a particular false discovery rate (FDR) the predictions need to satisfy, i.e. do they need to limit FDR to something like 10%? 50%?
You might also want to contact the original authors of the DeepBind paper (http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html). I made this repo. awhile ago when I identified a bug in the deepbind code, and I put my fixed version of the code here. Since then, I have been in touch with the authors of the paper, and they have incorporated my fix into their code-base.
Thank you both for your replies. It seems I have misunderstood some parts of the algorithm. We only have DNA seq data and no requirement yet for FDR. I will think about it and maybe come back here soon. Best, Eleni
You can use this code to predict whether a TF of interest will bind to a DNA sequence of interest. Let me know if you want help figuring out how to do it.
I think that you can find a more recent version of the code here: http://tools.genes.toronto.edu/deepbind/#
Dear Irene, Thank you for your response. I already got the newest version of the code but I still have some questions: 1) I have some input sequences that are > 50bp long. The algorithm recommends length up to 50bp, but it still accepts 1,000bp long sequences. Are the predictions reliable for 1,000bp sequences?
2) Can I give as input a sequence i.e. 2,000bp long by modifying the c code? Will the prediction make sense then?
Thank you very much for your help.
Best regards, Eleni
Hi Eleni,
Here are my thoughts:
I have not thoroughly tested deepbind on longer sequences. I would recommend asking the authors this question. Depending on how the code is set up, it is possible that it might not take a longer sequence as input.
I am not sure if there is a way to modify the code to take a 2,000bp sequence.
In general, convolutional neural networks assume a consistent input size. If the code can handle sequences of different sizes, it might be worth checking whether the code cuts the sequences down to all being the same size.
Irene
Thank you very much Irene! I already checked; the code does not cut all the sequences down to the same size. So, I will try contacting the authors on whether the predictions on larger genes are reliable.
Best regards, Eleni
Hi all,
Thanks for this great tool! I have a question on its use though. Could the user examine the binding specificity of the motifs in your models by querying a sequence > 10000b introducing an upper bound change to the script? Will the algorithm work properly in this case?
In general, I was wondering if you have applied DeepBind on a genome-wide scale
Thank you very much, Eleni