sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 23 forks source link

Avoiding oversuppression for small problems in kAnon #354

Open matthias-da opened 1 month ago

matthias-da commented 1 month ago

This is an example from MonteroSerrano, Javier, to practically see the oversuppresion problem.

Overprotection in 6x2 example

# Example with 6x2 data frame where kAnon (k = 3) makes 3 suppressions, 
# while 1 suppression would have been enough.
# (Note: 3 suppressions would be needed with alpha = 0, but not with alpha = 1).

# Create data
data_3 <- data.frame(
    gender = c("male", "male", "male", "male", "male", "male"),
    education = c("no education", "primary", "primary", "primary", "secondary", "secondary"))

# Create sdc object
sdc_data_3 <- createSdcObj(data_3, keyVars = c("gender", "education"), alpha = 1)

# kAnon with k = 3 makes 3 suppressions, but 1 suppression would have been enough.
sdc_data_kAnon <- kAnon(sdc_data_3, k = 3)
extractManipData(sdc_data_kAnon)
print(sdc_data_kAnon, "kAnon")

# Manually forcing 1 suppression generates data that already comply with 3-anonymity: 
data_3_edited         <- data_3
data_3_edited[1,2]    <- NA_character_
sdc_data_kAnon_manual <- createSdcObj(data_3_edited, keyVars = c("gender", "education"), alpha = 1)
print(data_3_edited)
print(sdc_data_kAnon_manual, "kAnon")

The reason is that kAnon is a heuristic algorithm that lead to oversuppression.

Idea of extensions: Implement a linear mixed-interger linear programming solution for small problems for an optimal suppression pattern. Guidance is given in Ton de Waal's book, Handbook of Statistical Data Editing and Imputation (Wiley).