greenelab / deep-review

A collaboratively written review paper on deep learning, genomics, and precision medicine
https://greenelab.github.io/deep-review/
Other
1.25k stars 271 forks source link

DeepSimulator: a deep simulator for Nanopore sequencing #744

Open agitter opened 6 years ago

agitter commented 6 years ago

https://doi.org/10.1101/238683

Motivation: Oxford Nanopore sequencing is a rapidly developed sequencing technology in recent years. To keep pace with the explosion of the downstream data analytical tools, a versatile Nanopore sequencing simulator is needed to complement the experimental data as well as to benchmark those newly developed tools. However, all the currently available simulators are based on simple statistics of the produced reads, which have difficulty in capturing the complex nature of the Nanopore sequencing procedure, the main task of which is the generation of raw electrical current signals. Results: Here we propose a deep learning based simulator, DeepSimulator, to mimic the entire pipeline of Nanopore sequencing. Starting from a given reference genome or assembled contigs, we simulate the electrical current signals by a context-dependent deep learning model, followed by a base-calling procedure to yield simulated reads. This workflow mimics the sequencing procedure more naturally. The thorough experiments performed across four species show that the signals generated by our context-dependent model are more similar to the experimentally obtained signals than the ones generated by the official context-independent pore model. In terms of the simulated reads, we provide a parameter interface to users so that they can obtain the reads with different accuracies ranging from 83% to 97%. The reads generated by the default parameter have almost the same properties as the real data. Two case studies demonstrate the application of DeepSimulator to benefit the development of tools in de novo assembly and in low coverage SNP detection. Availability: The software can be accessed freely at: https://github.com/lykaust15/deep_simulator.

@lykaust15 is the repository still private? I wasn't able to access it at the link above.

liyu95 commented 6 years ago

Thank you very much for finding the problem! Please refer to: https://github.com/lykaust15/DeepSimulator

I would update the link in Biorxiv as well.

souravsingh commented 6 years ago

@agitter I can work on adding this to the paper.

liyu95 commented 6 years ago

@agitter @souravsingh There are two other research papers from our group using deep learning method to solve bioinformatics problems. In case you are interested in, I put the link and the short introduction here.

DEEPre: sequence-based enzyme EC number prediction by deep learning: https://www.ncbi.nlm.nih.gov/pubmed/29069344

MOTIVATION: Annotation of enzyme function has a broad range of applications, such as metagenomics, industrial biotechnology, and diagnosis of enzyme deficiency-caused diseases. However, the time and resource required make it prohibitively expensive to experimentally determine the function of every enzyme. Therefore, computational enzyme function prediction has become increasingly important. In this paper, we develop such an approach, determining the enzyme function by predicting the Enzyme Commission number. RESULTS: We propose an end-to-end feature selection and classification model training approach, as well as an automatic and robust feature dimensionality uniformization method, DEEPre, in the field of enzyme function prediction. Instead of extracting manuallycrafted features from enzyme sequences, our model takes the raw sequence encoding as inputs, extracting convolutional and sequential features from the raw encoding based on the classification result to directly improve the prediction performance. The thorough cross-fold validation experiments conducted on two large-scale datasets show that DEEPre improves the prediction performance over the previous state-of-the-art methods. In addition, our server outperforms five other servers in determining the main class of enzymes on a separate low-homology dataset. Two case studies demonstrate DEEPre's ability to capture the functional difference of enzyme isoforms. AVAILABILITY: The server could be accessed freely at http://www.cbrc.kaust.edu.sa/DEEPre.

Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape: https://www.ncbi.nlm.nih.gov/pubmed/28961686

MOTIVATION: An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem. RESULTS: Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at https://github.com/ramzan1990/sequence2vec.

agitter commented 6 years ago

@souravsingh we are focusing on addressing the reviews this week so we can finalize a new version of the paper. We will not have time to review and merge new pull requests that aren't directly related to those. There are still several areas in #678 where we need help if you'd like to join in. I'm creating these new issues for future discussion more so than the next version of the manuscript.

Thanks @lykaust15, those also look interesting. We've been creating issues for each individual paper. If you'd like to discuss those, can you please create a new issue for each with the paper title as the issue title?