Closed scott-huberty closed 1 year ago
This classic non-parametric way to detect outliers (Tukey's method; https://en.wikipedia.org/wiki/Outlier) is Q2 +/- k*(Q3-Q1). I agree that the current code deviate from the original lossless implementation. However, before blindly reproducing what was used in the Matlab version, I'd like 1) to understand what is the rationale for using this approach, 2) if this is based on some validated approach for outlier detection, and 3) if there is an established reference that we can use to support that.
@christian-oreilly In
_get_outliers_quantile
(Which was split off from the giantmarks_array2flags
)We subtract or add the IQR from the median to get
l_out
,u_out
:https://github.com/lina-usc/pylossless/blob/2095f13eff76c5abd7729340642b57424511a207/pylossless/pipeline.py#L417-L418
I think we actually need to be subracting the lower quantile from the median for
l_out
and subtracting the median from the upper quantile for u_out- per the originalmarks_array2flags
:https://github.com/BUCANL/Vised-Marks/blob/3a788555bdc7e2e44810e6c5ae893e70aa01d77c/marks/measure_func/marks_array2flags.m#L140-L141
CC'ing @jadesjardins
Edit: I also think this is more in line with the description provided in the lossless paper. Which we can review. EDIT 2: lower/upper adjacent was the wrong term - changed.
So something like: