hasindu2008 / squigulator

a tool for simulating nanopore raw signal data
https://hasindu2008.github.io/squigulator
MIT License
58 stars 3 forks source link

Questions about event #17

Closed peiyihe closed 2 months ago

peiyihe commented 2 months ago

Hi,

I'm wondering can I get the event value from this tool? It seems it outputs the raw signal directly.

Thank you very much!

hasindu2008 commented 2 months ago

You could run my sigtk event command on the BLOW5 file generated from the squigulator.

peiyihe commented 2 months ago

Thanks! I would try it!

hasindu2008 commented 2 months ago

Let me know if you need some help with it, happy to help @Happy0430room.

peiyihe commented 2 months ago

By the way, is this event detector suitable for both R9.4 and R10.4?

hasindu2008 commented 2 months ago

Depends on the actual application of the event detection output. In my f5c alignment tool for instance, it is the same event detector with same parameters used for both types of data (equivalent to the one in sigtk I just shared, as far as I remember). What is your intended use?

peiyihe commented 2 months ago

I'm mainly trying to map the raw signal to the reference, just similar to the UNCALLED, mainly for hardware acceleration. For R9.4 chemistry, in my new method (for hardware), If one event value is 86.486336 (for example), which is also the value of k-mer 'AAAAAA' , I would regard this event is similar to k-mer 'AAAAAA'. It works well when I'm trying R9.4 dataset.

But when I'm trying R10.4 model,the k-mer model seems have been normalized, mean value of kmer model is 0, std is 1. It works not very well. Maybe the event detecor is not suitable for R10.4 I guess.

hasindu2008 commented 2 months ago

Actually, it is not the problem with the event detector, it is because a different normalisation scheme has been used in ONT's new model. I have converted this model to be close to the pA range when I use it in squigulator as well as in f5c and sigfish for alignment. f5c uses adaptive banded event alignment accelerated on GPU [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03697-x] where as sigfish uses dtw which we accelerated using an FPGA sometime ago [https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/giad046/7217084]. The built-in R10 models are hardcoded in the model.h, I can find the tsv version of this (have to search through my files) and give it to you tomorrow if you want.

peiyihe commented 2 months ago

Sorry to bother you for these simple questions, I'm still not very clear about how to convert the y axis value in the picture to pA, using the ONT api directly? And also I notice in f5c's R10 models, the std for every k-mer is relatively larger than R9, I'm a little bit confused, because I think R10 should be more precise than R9, so the std for every k-mer should be smaller? But it seems squigulator's R10 model has smaller std. These two R10 model seems very different. Is it influenced by the frequency?

Thank you very much! image

hasindu2008 commented 2 months ago

OK, here is the model in text format, the one I use in f5c (as of version 1.4) rna004_130bps.u_to_t_rna.9mer.template.model.txt

I will find the one I use in squigulator and attach in the next message - still searching for it.

Those pictures have the raw sample values which you should convert using the relavent fields in the signal file (e.g. BLOW5 file) as explained in here - see topic Primary fields, but I highly recommend you go through the whole SLOW5 specification as it explains a lot of stuff about the raw signals. Which language are you going to use - C or Python? When I use C, I simply use slow5lib like this https://github.com/hasindu2008/slow5lib/blob/e0d0d0f3da18374519b60924850c9e7f900a6bb3/examples/sequential_read.c#L28. If it is Python, you can use use the pA=true argument in relavent functions, see https://hasindu2008.github.io/slow5lib/pyslow5_api/pyslow5.html. I have not recently used the ONT's API, so I am not sure. We developed our own ecosystem called SLOW5 as ONT kept chnaging the structure of their file formats and we wanted something that is more backward compatible and few dependencies. Since SLOW5, I have not used ONT's formats, I simply convert them to SLOW5 and use that.

That r10 having a higher std than r9 is indeed a very good observation - the explanation is a bit long and hard to explain, I will answer it later when I get some time to write it down properly.

peiyihe commented 2 months ago

Thanks for your careful explaination, I would all the files you recommend. I observe that in the old r10 model, the std is large, but the newlest model in R10 with 5khz, it seems small, just like R9.4. I'm wondering how these models are generated, and why this two models are so different. Thanks!

hasindu2008 commented 2 months ago

OK here are the models:

f5c and squigulator DNA R9.4.1 model: https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_450bps.nucleotide.6mer.template.model

f5c and squigulator RNA R9.4.1 model: https://github.com/hasindu2008/f5c/blob/master/test/r9-models/r9.4_70bps.u_to_t_rna.5mer.template.model

f5c DNA R10.4.1 model f5c1.2_r10.4_400bps.nucleotide.9mer.template.model.txt

squigulator DNA R10.4.1 model sq0.3.0_r10.4_400bps.nucleotide.9mer.template.model.txt

f5c RNA004 model f5c1.4_rna004_130bps.u_to_t_rna.9mer.template.model.txt

squigulator RNA004 model sq0.3.0_rna004_130bps.u_to_t_rna.9mer.template.model.txt

DNA and RNA R9 models are directly from nanopolish, nanopolish adapted them from ONT's k-mer models I think.

How I got the f5c R10.4.1 model is explained in detail https://hasindu2008.github.io/f5c/docs/r10train. Could you read and see if you understand? In summary, I trained for 15 rounds and got the last one as the number of alignments was the best. You would observe that the standard deviation in the first rounds is low and then increase in the later rounds. I didn't look too much into the algorithm behind the nanopolish train and did not have a lot of time to spend. So there should be a lot of room for model improvement. And this was trained using R10.4.1 4KHz data - we did not sequence control samples for 5KHz. Nevertheless, the model trained on 4KHz still seems to be able to align 5KHz well, even though the results may not be optimal.

But the best model for f5c alignment did not yield the best results for signal simulation in squigulator. So I created a custom model by combining the mean values from the original pore model and the standard deviation from the 1st round of training. And again, model is based on 4KHz data, but used for 5KHz as well. So there is lots of room for improvement. @hiruna72 has been experimenting with his own model training methods which he could perhaps explain.

About your question - I observe that in the old r10 model, the std is large, but in the newest model in R10 with 5khz, it seems small, just like R9.4. I think you mistakenly thought f5c model is old and squigulator model is new. Above explanation hopefully clears it.

Why f5c standard deviations are large - the jumps in the signal when going from one event to another are included in the std computation as far as I remember (need to go into the code and double-check as it has been a while ago). This naturally overestimates the standard deviation for noise, which is not ideal for squigulator.

RNA004 was obtained in a similar fashion to R10.4.1 explained above.

It is a bit hard to explain using words, I know it wouldn't be 100% clear. But if you ask more questions I can try to clarify.

peiyihe commented 2 months ago

Thank you very much for your careful explaination, I think I can understant most of these information. I would read more carefully about your recommended files! Thanks!