hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

Default parameters used in Event detection function #130

Closed loganylchen closed 5 months ago

loganylchen commented 1 year ago

Hi there,

I have some questions about the event detection, which I knew was adopted from nanopolish/scrappie. I found the default parameters were different between RNA and DNA event detection.

static detector_param const event_detection_defaults = {.window_length1 = 3,
                                                        .window_length2 = 6,
                                                        .threshold1 = 1.4f,
                                                        .threshold2 = 9.0f,
                                                        .peak_height = 0.2f};

static detector_param const event_detection_rna = {.window_length1 = 7,
                                                   .window_length2 = 14,
                                                   .threshold1 = 2.5f,
                                                   .threshold2 = 9.0f,
                                                   .peak_height = 1.0f};

I could quite understand the reasons: 1. RNA molecules translocate much slower than the DNA (larger window length), 2. RNA signals were more fluctuation (higher threshold and peak_height). Correct me if I were wrong.

I think the settings were quite reasonable. I raised the issue here just to ask for a feature in the "f5c eventalign" function, which I could specify these default parameters. I knew I could fork the repo and modify the default numbers and then compile to binary f5c to do the test, but I think exposing these numbers to the command line args will help me a lot, reducing manual works.

Best, Logan

hasindu2008 commented 1 year ago

Hi

Yeh, these default parameters are coming from nanopolish/scrappie. Do you require these options to be exposed for exploratory purposes or is it for use in a routine pipeline? If it is just for exploratory purposes, I can do a quick hacky implementation to expose those parameters in a separate branch for now. If it is for routine use, will have to spend some time to properly do it and some testing.

These defaults for R9.4 are perhaps not the best for R10.4 data as the signal is slightly different to R9.4. So tuning these parameters in theory could improve the alignments for R10.4 - I cannot find time to optimise these. Are you working on R10.4?

loganylchen commented 1 year ago

Hi @hasindu2008 ,

For now, it is just for some exploratory purposes. It would be very grateful if you help me expose the parameters in a separate branch.

I am now not working on the R10.4, but I will work on it if I find tuning these numbers could help alignments.

Best,

loganylchen commented 1 year ago

I think I will catch you up if I find specifying these parameters is necessary for some different scenarios.

hasindu2008 commented 1 year ago

Will something like _--edparam window_length1,window_length2,threshold1,threshold2,peakheight would do?

loganylchen commented 1 year ago

Sure. It should support both RNA and DNA, right?

hasindu2008 commented 1 year ago

Yes, if you pass --edparam option to eventalign, the defaults will be overridden by what the user provides. I have quickly implemented here. Have not tested so let me know if something is off. https://github.com/hasindu2008/f5c/tree/edparam

Usage: --edparam comma separated list of parameters (window_length1,window_length2,threshold1,threshold2,peak_height) for event detection

loganylchen commented 1 year ago

Hi @hasindu2008 ,

I've tried the branch, and it works well. However, I still encountered an issue, and need your help. As I specified the window_length1 to a smaller number, 3 or 5 for example, many reads will be labeled as qc failed. I knew it was caused by the events_per_base, which indicated the quality of the alignment. But as I set a smaller window size, the event should be over-segmented, and will be more events that represented one base.

So I am asking if I could also specify this 5.0 in the arguments?

In the f5c.c

    if (db->events_per_base[i] > 5.0) {
        //     events[0].clear();
        //     events[1].clear();
        //free(db->event_align_pairs[i]);
        db->read_stat_flag[i] |= FAILED_QUALITY_CHK;
        return;
    }

Best

hasindu2008 commented 1 year ago

Hi, try the newly added --max-events-base

hasindu2008 commented 5 months ago

I will close this issue for now. Feel free to reopen if you have any more questions/requests.