sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 22 forks source link

localSuppression(): unable to produce k-anonymous dataset #225

Closed ElMuto closed 7 years ago

ElMuto commented 7 years ago

Using this input

Age_Group;Sex;State
[60,80[;M;CAL
[60,80[;M;CAL
[60,80[;F;CAL
[60,80[;F;CAL
[20,50[;M;MS
[20,50[;M;MS
[20,50[;M;CAL
[20,50[;M;CAL
[20,50[;M;CAL

and this R code

require(sdcMicro)

inputdata <- readMicrodata(path="C:/temp/example-in.csv", type="csv", convertCharToFac=TRUE, drop_all_missings=TRUE, header=TRUE, sep=";")

datasetVars = colnames(inputdata)
qiVars      = datasetVars

sdcObj <- createSdcObj(dat=inputdata, keyVars=qiVars,  numVars=NULL,  weightVar=NULL, hhId=NULL, strataVar=NULL, pramVars=NULL, excludeVars=NULL, seed=0, randomizeRecords=FALSE, alpha=c(1))
sdcObj = localSuppression(sdcObj, k=3, importance = NULL, combs = NULL)

writeSafeFile(obj=sdcObj, format="csv", randomizeRecords="no", col.names=TRUE, sep=";", dec=".", fileOut="C:/temp/example-out.csv")

, I get this result

"Age_Group";"Sex";"State"
"1";"[60,80[";NA;"CAL"
"2";"[60,80[";NA;"CAL"
"3";"[60,80[";"F";"CAL"
"4";"[60,80[";"F";"CAL"
"5";"[20,50[";"M";NA
"6";"[20,50[";"M";NA
"7";"[20,50[";"M";"CAL"
"8";"[20,50[";"M";"CAL"
"9";"[20,50[";"M";"CAL"

Although k is set to 3 in the R code above, rows 5 and 6 in the resulting dataset form an equivalence class of size 2. Therefore the resulting dataset is only 2-anonymous (similar behaviour with k=4, k=5, etc).

What am I doing wrong?

matthias-da commented 7 years ago

Hi,

thanks for this message.

Can you please play around with the argument alpha (n the App this parameter can be set between 0 and 1 when setting up the SDC problem). In command line, it looks then

localSuppression(sdcObj, k=3, importance = NULL, combs = NULL, alpha = 0)

Its all about how you count frequencies with missing values. With default alpha = 1, row 5 and 6 fulfils even 5-anonymity while with alpha = 0, this is not the case. Even more details you can found here http://www.springer.com/de/book/9783319502700

matthias-da commented 7 years ago

it's fully implemented thus issue closed.

ElMuto commented 7 years ago

Thank you very much for your response. I understand now how k-anonymity is calculated in sdcMicro.

Is there a way to instruct sdcMicro to treat missing values like an own category (as described in section 3.2.2 of the reference you sent me)?

My apologies for asking questions like this here - I'll be happy to switch to another chanel, if you prefer.

matthias-da commented 7 years ago

if I'm not wrong, this is with alpha=0

bernhard-da commented 7 years ago

also, you can recode the missings (NA) into something different.

ElMuto commented 7 years ago

Thanks again for your response.

I am not sure if setting alpha=0 will produce the desired result.

To make my point more clear, I created another example which is based on section 4.2.2.1 of the textbook you have recommended. What I'm trying to achieve is to produce the result that is presented for "Method 5 (own category)" in Table 4.2. For your convenience, I restructured the code so that the example is self-contained:

require(sdcMicro)
Region    <- c("A","A","A","A","A")
Status    <- c("Single","Married","Married","Single","Widow")
Age_group <- c("30-49","30-49","30-49","30-49","30-49")
dataset <- data.frame(Region,Status,Age_group)

# Works
sdc <- createSdcObj(dataset, keyVars=c('Region', 'Status', 'Age_group'), alpha=1)
sdc = localSuppression(sdc, k=3, importance = NULL, combs = NULL)
print(sdc, "kAnon")

# Loops (foever?)
sdc <- createSdcObj(dataset, keyVars=c('Region', 'Status', 'Age_group'), alpha=0)
sdc = localSuppression(sdc, k=3, importance = NULL, combs = NULL)
print(sdc, "kAnon")

Unfortunately, I'm not able to verify my assumption, since localSuppression() seems not to terminate when using alpha<>1.

matthias-da commented 7 years ago

Cannot work when alpha = 0

Even when all values in other variables set to NA, you still do not fulfil k-anonymity as soon you interpret Widow as own category.

So even here you dont have k-anonymity for alpha = 0. This is easy to see:

Region Status Age_group 1 A NA 30-49 2 A NA 30-49 3 A NA 30-49 4 A NA 30-49 5 A Widow 30-49

bernhard-da commented 7 years ago

but we'll probably provide a "fix" or at least a better solution

matthias-da commented 7 years ago

oh, I see, but this would be a solution

Region Status Age_group 1 A NA 30-49 2 A NA 30-49 3 A NA 30-49 4 A NA 30-49 5 A NA 30-49

will come on the todo list. I expect that for real world data this is not much the case, but it should be solved in any case.

ElMuto commented 7 years ago

Hello matthias-da and bernhard-da,

I really appreciate your support with this issue! I have just been able to identify a working example that supports my assumption (that alpha=0 would not result in sdcMicro appling Method 5 (own category)). It is taken from Table 4.1 in the same book. AFAICS, it is basically identical to the last example, except for the value of k, which is 2 here (and, of course, alpha, wich is set to 0, as suggested).

require(sdcMicro)
Region    <- c("A","A","A","A","A")
Status    <- c("Single","Married","Married","Single","Widow")
Age_group <- c("30-49","30-49","30-49","30-49","30-49")
dataset <- data.frame(Region,Status,Age_group)

sdc <- createSdcObj(dataset, keyVars=c('Region', 'Status', 'Age_group'),
                    alpha=0)
sdc = localSuppression(sdc, k=2, importance = NULL, combs = NULL)
print(sdc, "kAnon")
extractManipData(sdc)

In this example, the anonymized data

  Region  Status Age_group
1      A  Single     30-49
2      A Married     30-49
3      A Married     30-49
4      A  Single     30-49
5      A    <NA>     30-49

is not 2-anonymous, if <NA> is considered as an own category. The result rather suggests that using alpha=0 leads to sdcMicro using Method 2 (conservative) in Table 4.1.

If my assumption is correct it would be great to know, if there is a way to achieve k-anonymity using localSuppression() while missing values are treated as an own category.

Kind regards

matthias-da commented 7 years ago

Hi, I'm sorry to not even can look anymore on the tables and about how we implemented it and what was our philosophy behind. I only know that the table is very safe in case the attacker do not know what NA can be (that I think is the case in practice). I'm on holiday now until August. Best,

bernhard-da commented 7 years ago

hi @ElMuto, thx for the example. i just pushed an update to the next-branch. could be please verify that everything works as expected.

ElMuto commented 7 years ago

hi @bernhard-da, I just tested it with a couple of datasets. The execution time issues are definitely resolved. k-anonymity in some small datasets seems to work as expected. In bigger datasets there seem to be a small fraction of datasets violating k-anonymity (in my case: k=5). However, afaics at the moment, thats's fine for me. But if you're interested in the test data, please just give me a notice.

Anyways: thanks a l o t for your help !

bernhard-da commented 7 years ago

hi @ElMuto, thx for confirming.

it would be nice if you could give us a problem instance in which k/5-anonymity was not achieved.