lina-usc / pylossless

🧠 EEG Processing pipeline that annotates continuous data
https://pylossless.readthedocs.io/en/latest/
MIT License
18 stars 8 forks source link

Bug in _get_outliers_quantile #95

Closed scott-huberty closed 1 year ago

scott-huberty commented 1 year ago

@christian-oreilly In _get_outliers_quantile (Which was split off from the giant marks_array2flags)

We subtract or add the IQR from the median to get l_out, u_out:

https://github.com/lina-usc/pylossless/blob/2095f13eff76c5abd7729340642b57424511a207/pylossless/pipeline.py#L417-L418

I think we actually need to be subracting the lower quantile from the median for l_out and subtracting the median from the upper quantile for u_out- per the original marks_array2flags:

https://github.com/BUCANL/Vised-Marks/blob/3a788555bdc7e2e44810e6c5ae893e70aa01d77c/marks/measure_func/marks_array2flags.m#L140-L141

CC'ing @jadesjardins

Edit: I also think this is more in line with the description provided in the lossless paper. Which we can review. EDIT 2: lower/upper adjacent was the wrong term - changed.

So something like:

l_dist = mid_val - lower_val
u_dist = upper_val - mid_val
l_out = mid_val - l_dist
u_out =  mid_val + u_dist
christian-oreilly commented 1 year ago

This classic non-parametric way to detect outliers (Tukey's method; https://en.wikipedia.org/wiki/Outlier) is Q2 +/- k*(Q3-Q1). I agree that the current code deviate from the original lossless implementation. However, before blindly reproducing what was used in the Matlab version, I'd like 1) to understand what is the rationale for using this approach, 2) if this is based on some validated approach for outlier detection, and 3) if there is an established reference that we can use to support that.