Open rempsyc opened 7 months ago
This issue is not seen with Mahalanobis:
library(Routliers)
set.seed(42)
df <- rnorm(n = 100, mean = 100, sd = 10) |>
replicate(n = 25) |>
as.data.frame()
outliers_mahalanobis(df[1:2])$nb
#> total
#> 1
outliers_mahalanobis(df[1:3])$nb
#> total
#> 1
outliers_mahalanobis(df)$nb
#> total
#> 0
Created on 2024-02-02 with reprex v2.0.2
With only two variables, the MCD method seems fine, with few outliers detected (1%). However, with only one more variable (3), the MCD method detects a whopping 14% outliers, even in well-behaved (normally-distributed) variables. With a more realistic number of variables in real datasets (25 variables), approximately half the sample (47%) is flagged as outliers. Doesn’t that seem high to you? Do you have any explanations or comments on this?
Created on 2024-02-02 with reprex v2.0.2