Extremely high outlier detection rate in `outliers_mcd()` with large datasets

With only two variables, the MCD method seems fine, with few outliers detected (1%). However, with only one more variable (3), the MCD method detects a whopping 14% outliers, even in well-behaved (normally-distributed) variables. With a more realistic number of variables in real datasets (25 variables), approximately half the sample (47%) is flagged as outliers. Doesn’t that seem high to you? Do you have any explanations or comments on this?

library(Routliers)

set.seed(42)
df <- rnorm(n = 100, mean = 100, sd = 10) |>
  replicate(n = 25) |>
  as.data.frame()

rempsyc::nice_normality(df, "V1")


set.seed(42)
outliers_mcd(df[1:2])$nb
#> total 
#>     1

set.seed(42)
outliers_mcd(df[1:3])$nb
#> total 
#>    14

set.seed(42)
outliers_mcd(df)$nb
#> total 
#>    47

^{Created on 2024-02-02 with reprex v2.0.2}

mdelacre / Routliers

Extremely high outlier detection rate in `outliers_mcd()` with large datasets #8