Open TommyJW opened 6 years ago
Hi Tommy,
Could you send me a sample small dataset that produces the errors? I can use google colab to then try it out and figure out the source of the issue
Attached is a notebook that goes through the whole workflow, and a small sample dataset.
In the Build Keras Model cell, try switching between the different models. I've gotten different results as to which ones work or don't work with TF MoDISco depending on system environment, dataset used, and subsetting data. I've also built a randomizing function to pass different parameters to the main MoDISco call in another notebook and I'm currently analyzing the data for any pattern there.
Hi Tommy,
Here's a notebook where I was able to run TF-MoDISco using the model that it was supposed to fail with. The key thing I did was to trim out the zeros from sequences that were shorter than the maximum length (TF-MoDISco can handle variable length sequences): https://gist.github.com/AvantiShri/6428ca274e55c8d242f3429ee9ca42be
I also made some tweaks to the parameters so that it produced motifs for both metacluster 0 (negative activity) and metacluster 1 (positive activity). Maybe the results will make more sense to you since you are more familiar with the biology of the problem. The main pattern that jumps out, which you also seemed to find based on your visualizations, is that different segments of the sequence have different GC-content preferences. Beyond that, it's hard to tell what may be real at such a small number of sequences. In general, the patterns that have more seqlets mapping to them are more likely to be real.
Let me know if you have more questions.
Also, it sounds like you are studying lncRNAs - this is obviously a very different kind of dataset than the TF-binding datasets I developed TF-MoDISco on, so if there are assumptions made in TF-MoDISco that don't apply in these other contexts, I'd be interested to hear about them (I may not have the bandwidth to work on other applications at this stage, but I'd be happy to give advice on how the algorithm could be tweaked for different purposes)
I've repeatedly encountered motif discovery failing in Round 2 with a NaN or value too high exception.
To isolate, I've tried several models with different layers and layer structures fitted and scored with DeepLIFT. I've also tried adjusting from the default parameters (as noted in the notebook) and the parameters used by the notebook. I have yet to find a pattern in the failures.
Additionally, I've tried different subsets of the same data, and the complete data set. I've also tried alternate sequence data sets. The only thing I've noticed is the smaller subset dataset tends to produce the error less, but this is inconsistent with the alternate datasets that are inherently small.
For comparisons: I have a HOTAIR dataset "GSE31332_hotair_oe_peaks" of 832 sequences we'll call this the full set Longest sequence 2551 Shortest sequence 756 padded with 0s I have subsetted it to 155 sequences we'll call this one the 'small' subset
I'm not sure what information would be helpful to identify what I can improve in preprocessing or parameters passed to avoid the exception.
If it will help I can also bundle a notebook and dataset that includes the whole workflow from classification to motif discovery. I could also provide the raw output from the motif discovery calls.
Thanks