hasindu2008 / f5c

Ultra-fast methylation calling and event alignment tool for nanopore sequencing data (supports CUDA acceleration)
https://hasindu2008.github.io/f5c/docs/overview
MIT License
144 stars 26 forks source link

[Request] An option to filter out signal spikes #114

Closed patbohn closed 5 months ago

patbohn commented 2 years ago

Hi Hasindu,

thank you for developing f5c and the detailed documentation with it, it is incredibly fast. I am using it to detect RNA modifications, and in one part of our studies we want to use PORE-cupine Aw et al., which takes Nanopolish eventalign output as input for their SVMs, so f5c already is compatible.

However, there is one step that currently is different from the way f5c behaves, and that is that they modified nanopolish to include a filtering step for raw signal (pA) values below 0 or above 200 pA, the code of which is found in nanopolish_fast5_io.cpp, below line 211 in your copy of it: https://github.com/hasindu2008/f5c/blob/96f7f1c98f0cb48f836142b7c60c89833bc49a25/src/nanopolish_fast5_io.c#L212

The additional code is this

    tmp_nsample=nsample; //added

    // convert to pA
    rawtbl = (raw_table) { nsample, 0, nsample, rawptr };
    raw_unit = scaling.range / scaling.digitisation;
    for (size_t i = 0; i < nsample; i++) {
        rawptr[i] = (rawptr[i] + scaling.offset) * raw_unit;
    }

// filter the outliners in pA
// this is edited by awjga

for (size_t i = nsample-1; i >0 ; i--) {
    //filter sample > 200 pA and less than 0 pA 
    if(rawptr[i]>200 || rawptr[i]<0){
        tmp_nsample=tmp_nsample-1;
        for (size_t dd=i;dd<nsample;dd++){
            rawptr[dd]=rawptr[dd+1];
        }
    }
}
//to update the value of nsample 
nsample=tmp_nsample;
rawtbl = (raw_table) { nsample, 0, nsample, rawptr };

Which as far as I understand removes any value below 0 or above 200. I noticed that the function fast5_get_raw_samples was commented out in your code and you replaced it with a different function, but I have yet been unable to implement this filtering in your code (as I have essentially no experience with C/C++).

Could you help me with implementing this feature either hardcoded (I'm able to compile it easily thanks to your documentation) or as a command line option?

Thanks in advance and best wishes, Patrick

hasindu2008 commented 2 years ago

Hi Patrick,

Thanks for the warm compliments. In the dev branch of f5c, I recently cleaned up the code so it would be a bit cleaner.

That part in f5c where the signal is scaled, happens here, https://github.com/hasindu2008/f5c/blob/3adf279717235ec963e8ee5e48816941bbd8fff4/src/f5c.c#L522. Do you want to try patching the code or I could quickly implement this in a separate branch so that you could test it?

patbohn commented 2 years ago

Thank you for pointing me to the right place, I'm currently trying to implement it myself in this fork:

https://github.com/hasindu2008/f5c/compare/master...patbohn:f5c_filter_signal:master

hasindu2008 commented 2 years ago

Great. Let me know if you are facing any difficulties. I think your approach of replacing the signal is better than skipping it completely; the signal indexes are now consistent with the real raw signal.

One suggestion is, it might be better to implement this as a separate loop after the pA conversion. It is true that then we will be iterating through the signal twice, but then the branch condition if (core->opt.flag & FILTERSIGNAL) can be done outside the loop rather than per-each element. Then pA conversion is likely to be vectorised by the compiler.

Also if the first few elements are >200 or <0 replacing the first element with the next won't do much difference - but it is extremely rare I think.

patbohn commented 2 years ago

Thank you for your input, I see your points and implemented it as a separate loop now. Also I think I fixed the rare issue with not filtering when having multiple events starting from 0 out of signal range by then looking for the first signal that is not out of range.

On a side note, I hope I implemented the parameter option in the header file and argument parser correctly, not quite sure how exactly the definitions in the header file work.

hasindu2008 commented 2 years ago

It seems you have implemented the option flag correctly - I just skimmed through it. Options are simply stored in a 64-bit integer where each bit is a flag. Bit manipulation is done by bitwise and/or operations - I learnt this technique from minimap2's implementation.

hasindu2008 commented 5 months ago

I will close this issue for now. Feel free to reopen if you have any more questions.