skovaka / uncalled4

MIT License
43 stars 3 forks source link

Event Detector Parameters #25

Closed peiyihe closed 5 months ago

peiyihe commented 5 months ago

Hi Sam,

I go through your code in uncalled4, I'm not very clear about your parameter choose. Because I'm trying to balance the skip errors and stay errors. It seems in current parameter, there are much more skip errors in the event. And I still cannot understand too much about the first five parameters in this EventDetector. Thanks.

EventDetector::PRMS_450BPS = {
        window_length1 : 3,
        window_length2 : 6,
        threshold1     : 1.4,
        threshold2     : 9.0,
        peak_height    : 0.2,
        min_mean       : -200,
        max_mean       : 200
    }
skovaka commented 5 months ago

The event detection algorithm and those parameters were taken from Scrappie. It uses two sets of rolling window t-tests and places event boundaries when either t-statistic reaches a peak above the threshold, so "window_length1" corrasponds to "threshold1". "peak_height" is used by the peak finding algorithm for both tests. I'll include a figure I made to illustrate this below.

For slower sequencing speeds, I just increased the window lengths and thresholds ~proportionally to the increase in expected dwell time. I also measured skip/stay rates on real data and tried to make them approximately equal for different sequencing speeds. You're right that skips are still somewhat frequent. You could try decreasing the thresholds or window lengths to generate more stays and fewer skips, but I haven't developed a systematic way to adjust them.

image

peiyihe commented 5 months ago

Thanks! By the way, may I ask where can I learn this event segmentation carefully like the figure you used? Because it seems that there isn't a paper about scrappie and its event detect.

skovaka commented 5 months ago

I made that figure myself by inserting print statements into the event detection C++ code, which isn't very reproducible and I didn't explore it much further, so that's why it isn't published. I've also experimented with a python rolling t-test implementation, which I'll include below. It only returns t-scores and peaks from one windowed t-test, which isn't as accurate as the consensus between two windows, but it's easier to understand and make plots. The names of "height" and "threshold" parameters are swapped to match scipy's "find_peaks" function, which isn't exactly the same as Scrappie's implementation but produces similar results

import numpy as np
import bottleneck as bn
from scipy.signal import find_peaks

def roll_ttest(sig, win, height=1.4, threshold=0.2, abs=True):                  
    mean1 = np.roll(bn.move_mean(sig, win),-win+1)                                
    mean2 = np.roll(mean1,-win)                                                   
    var1 =  np.roll(bn.move_var(sig, win),-win+1)                                 
    var2 = np.roll(var1, -win)                                                    
    tscore = np.roll((mean1 - mean2) / np.sqrt((var1+var2)/win),win)              
    tscore[:win] = np.nan                                                         
    tscore[-win:] = np.nan                                                        
    if abs:                                                                       
        tscore = np.abs(tscore)                                                   

    peaks,prop = find_peaks(tscore,threshold=threshold,height=height,distance=win)

    return tscore,peaks                                                           
peiyihe commented 5 months ago

Thanks, now it's very clear!