hasindu2008 / sigtk

A simple toolkit for manipulating nanopore signal data
MIT License
18 stars 3 forks source link

Parameter about event detection #10

Closed peiyihe closed 1 month ago

peiyihe commented 3 months ago

Hi, It seems that this event detection is similar to scrappie, I'm a little bit confused about these parameters. It seems that the length of event is much longer than basecalled nucleotide reads? Thank you very much!

typedef` struct {
    size_t window_length1;
    size_t window_length2;
    float threshold1;
    float threshold2;
    float peak_height;
} detector_param;

static detector_param const event_detection_defaults = {
    .window_length1 = 3,
    .window_length2 = 6,
    .threshold1 = 1.4f,
    .threshold2 = 9.0f,
    .peak_height = 0.2f
};
hasindu2008 commented 3 months ago

Hello,

Yes, this event detection is indeed from scrappie.

It was not clear to me what you meant by "It seems that the length of event is much longer than basecalled nucleotide reads?" When you say the length of the event are you referring to the window_length1 and window_length2?

peiyihe commented 3 months ago

Thanks for your quick reply!

I'm sorry that I don't express myself clear. For one raw signal read, the event generated by scrappie seems twice than the actual read nucleotide length. For example if one read, its length of event is 4000, but actually the length of this read is 2000. I guess maybe event detector want to keep more "stay" error type? Now I want to generate the event, which can balance the stay and skip error, if the length of events is similar to actual length of read, it would be better.

But for those parameters in Scrappie, I'm not very clear how should I modify them. Thanks for your patience!

hasindu2008 commented 3 months ago

I now get what you mean. This sigtk event detector's parameters are same as those used in f5c for aligning events to reference. For this alignment applications in f5c, it is preferable to over-segment as stays can be simply merged after alignment, but skips cannot be. In f5c, a single base is ~2.5 events on average. So a read that is 2000, reads would be 5000 on average (can vary from read to read). f5c publication https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03697-x and most importantly the supplementary material https://hasindu2008.github.io/f5c/nanopore_signal_alignment_supplementary_material.pdf have a bit of information about these events.

If you want to balance, you could have a look at what sigmap authors are doing https://academic.oup.com/bioinformatics/article/37/Supplement_1/i477/6319675. I do not remember the exact details now, but when I read the paper and went through their code, I have a vague memory that they mentioned trying to balance stays and skips. I might be wrong though - could you have a look? Or perhaps @haowenz who is the author of sigmap perhaps could comment on this.

Another few places that potentially use segmentation are:

  1. https://github.com/skovaka/uncalled4
  2. https://github.com/CMU-SAFARI/RawHash

You can check if they have done any parameter adjustments.

peiyihe commented 3 months ago

Thank you very much for your careful explaination. I learned a lot from this!