mdelacre / Routliers

Other
11 stars 0 forks source link

Extremely high outlier detection rate in `outliers_mcd()` with large datasets #8

Open rempsyc opened 7 months ago

rempsyc commented 7 months ago

With only two variables, the MCD method seems fine, with few outliers detected (1%). However, with only one more variable (3), the MCD method detects a whopping 14% outliers, even in well-behaved (normally-distributed) variables. With a more realistic number of variables in real datasets (25 variables), approximately half the sample (47%) is flagged as outliers. Doesn’t that seem high to you? Do you have any explanations or comments on this?

library(Routliers)

set.seed(42)
df <- rnorm(n = 100, mean = 100, sd = 10) |>
  replicate(n = 25) |>
  as.data.frame()

rempsyc::nice_normality(df, "V1")


set.seed(42)
outliers_mcd(df[1:2])$nb
#> total 
#>     1

set.seed(42)
outliers_mcd(df[1:3])$nb
#> total 
#>    14

set.seed(42)
outliers_mcd(df)$nb
#> total 
#>    47

Created on 2024-02-02 with reprex v2.0.2

rempsyc commented 7 months ago

This issue is not seen with Mahalanobis:

library(Routliers)

set.seed(42)
df <- rnorm(n = 100, mean = 100, sd = 10) |>
  replicate(n = 25) |>
  as.data.frame()

outliers_mahalanobis(df[1:2])$nb
#> total 
#>     1

outliers_mahalanobis(df[1:3])$nb
#> total 
#>     1

outliers_mahalanobis(df)$nb
#> total 
#>     0

Created on 2024-02-02 with reprex v2.0.2