hasindu2008 / sigtk

A simple toolkit for manipulating nanopore signal data
MIT License
18 stars 3 forks source link

Scrappie nanopore - squiggle function #3

Closed nguyennhuttin closed 3 months ago

nguyennhuttin commented 2 years ago

Can I ask you about the event segmentation from scrappie you implemented - what's algorithm is it, can you direct me to any material to understand it, I don't think you have the predict squiggle function I am looking for . This function takes in a sequence of bases and output segment properties of each base ( mean, time, variation). Event segmentation takes in raw signal and output properties. Correct me if I am wrong. Any help with understanding the predict squiggle function, how it works is appreciated

hasindu2008 commented 2 years ago

Hi, the event segmentation algorithm simply breaks the signal into segments that are approximately indicative of the bases - some details are there in the f5c publication [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03697-x] and its supplementary [https://ndownloader.figstatic.com/files/24160715].

The closest that I have for that squiggle prediction is sref. If you provide a sequence of bases in fasta format and call sigtk sref seq.fa, it will predict the mean for each k-mer/base [see https://github.com/hasindu2008/sigtk#synthetic-reference-sref]. I can add the standard deviation easily. The prediction of time is something I haven't much looked into - is that also necessary?

By the way, what is the bigger picture of why you are looking for a signal prediction? If I have a bit more info on this, I might be able to point to some specific information.

nguyennhuttin commented 2 years ago

I take a look at sref DNA model It is a table of 6kmer ( 6 bases sequence ) each with its own properties ( mean, std, time). Could I ask how did you create this?

Correct me if I am wrong but I heard that Scrappie is a machine learning trained on the sequence of up to 49 bases ( so the squiggle prediction when taking in a DNA fasta file generates properties for each base )

The bigger picture is that I want a machine learning model to generate squiggle based on the given DNA ( nucleotide base sequence ). I have been using scrapie to generate data for my work so I want to understand how the Scrappie squiggle prediction function works as well ( How different types of pore effect, how accurate can we expect for the generated squiggle )

hasindu2008 commented 2 years ago

These models are models from Nanopolish. Nanopolish does an HMM-based training to get this pore model. The pore model is compatible with R9.4 pores. sigtk sref just uses this model as a lookup table to give you an ideal model signal (without any noise) for a nucleotide sequence. If you want a signal with the noise I have another tool that is currently in development that I could share, but it uses a classical Gaussian statistical model for the simulation (not a fancy machine learning-based method though, but we could successfully basecall the simulated signals and map them and even variant call). If you are interested, I can look into getting you access to an early alpha release.

I am not entirely sure how scrappie is doing the prediction @Psy-Fer has used this feature and could possibly add something about it.

Psy-Fer commented 2 years ago

Hello, Yea scrappie uses nanonet models for the squiggle predictions from a fasta input.

You can see some basic info and hints for that here https://github.com/nanoporetech/scrappie#gotyas-and-notes

then you can see the models for the different pore types here https://github.com/nanoporetech/scrappie/tree/master/src/models

So when using scrappie to create a signal, you give it the fasta to predict for, and the model to use for the prediction.

You can see how I've done this in the past using the python API for scrappie inside my MotifSeq.py tool in SquiggleKit See this function here: https://github.com/Psy-Fer/SquiggleKit/blob/master/MotifSeq.py#L382

Scrappie isn't really working much these days, so we are moving away from it and building our own methods.

James.

nguyennhuttin commented 2 years ago

Hi, thanks for the help, I am not sure about squiggle prediction but Scrappie in terms of base-calling is completely different from nanonet ( they use different architecture nanonet has 2 LSTM, Scrappie has 5 GRU).

How do you that Scrappie predict squiggle using the nanonet model? It is only mentioned that "The squiggle prediction is based on Laplace distributed errors."

Could you share with me any approach for moving away from scrappies Thanks

Psy-Fer commented 2 years ago

Hello,

If you look at the python api for using scrapie to predict squiggles, you will see it has a set of valid models it can use.

If you look inside one of the few models available for squiggle prediction in the folder of models I provided above, you will see they are nanonet models by looking at the 2nd line.

We have our own squiggle simulator that @hasindu2008 has been developing. But as he says, it's still in early development and we are still testing/benchmarking it.

hasindu2008 commented 1 year ago

@nguyennhuttin forgot to reply to this, but it is this https://github.com/hasindu2008/squigulator which you may find useful for your usecase.