sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 23 forks source link

how to change the manipulated data in sdc object? #262

Closed orchardlucky closed 6 years ago

orchardlucky commented 6 years ago

We can extract the manipulated data in the object of class ' sdcMicroObj' after applying the SDC methods. I wonder whether we can change the values of the manipulated data in the object of class ' sdcMicroObj' directly without applying any SDC methods. Thanks.

bernhard-da commented 6 years ago

hi @orchardlucky

manipulating values directly within the sdcMicro-object is possible, though not suggested because risk-and utility measures don't re-calculate if you do so.

to modify the data, see the following example:

## create obj
library(sdcMicro)
data(testdata)
sdc <- createSdcObj(testdata,
  keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),
  numVars=c('expend','income','savings'), w='sampling_weight')

## apply a anon-method for cat- and numerical vars
sdc <- kAnon(sdc, k=4)
sdc <- addNoise(sdc,variables=c("expend","income"))

## manipulated data are stored in slots named 'manipXYZ'
slotNames(sdc)
 [1] "origData"          "keyVars"           "pramVars"          "numVars"          
 [5] "ghostVars"         "weightVar"         "hhId"              "strataVar"        
 [9] "sensibleVar"       "manipKeyVars"      "manipPramVars"     "manipNumVars"     
[13] "manipGhostVars"    "manipStrataVar"    "originalRisk"      "risk"             
[17] "utility"           "pram"              "localSuppression"  "options"          
[21] "additionalResults" "set"               "prev"              "deletedVars"   

## modified key variables in slot @manipKeyVars, this is just a data.frame
df <- get.sdcMicroObj(sdc, "manipKeyVars"); head(df, 3)
 urbrur roof walls water electcon relat sex
      2    4     3     3        1     1   1
      2    4     3     3        1     2   2
      2    4     3     3        1     3   1

## modify and update sdcMicro-Obj
df[1,1] <- NA
slot(sdc, "manipKeyVars") <- df
head(get.sdcMicroObj(sdc, "manipKeyVars"), 3)
 urbrur roof walls water electcon relat sex
     NA    4     3     3        1     1   1
      2    4     3     3        1     2   2
      2    4     3     3        1     3   1

## same appraoch also for manipulated numerical key variables available 
## in slot @manipNumVars
df <- get.sdcMicroObj(sdc, "manipNumVars"); head(df, 3)
    expend       income   savings
 128066004     71023.44  116258.5
  29366780 -26631903.83  279345.0
 -34111570  90253281.70 5495381.0

## update and modify the sdc-obj
df[1,1] <- NA
slot(sdc, "manipNumVars") <- df
head(get.sdcMicroObj(sdc, "manipNumVars"), 3)
    expend       income   savings
        NA     71023.44  116258.5
  29366780 -26631903.83  279345.0
 -34111570  90253281.70 5495381.0

In case you just want to add some variables after your anonymization process is finished, you can also use ?extractManipData to return the protected (manipulated) data and modify this data.frame.

orchardlucky commented 6 years ago

Dear @bernhard-da,

Thanks.

I have some further questions.

The first one is: Whether the parameter "keyVars" in the 'createSdcObj' function refers to the categorical key variables?

You said in the previous answer that ' ## modified key variables in slot @manipKeyVars, this is just a data.frame'. Whether the 'key variables' in this sentence also refers to the categorical key variables? So key variables is short for categorical key variables?

My another question is that:

Let's assume at first, we have a original data set. Is it right that we can only measure the disclosure risk based on the categorical key variables before applying any SDC methods to it? We can measure the disclosure risk based on the continuous key variables just after applying the SDC methods, is it right?

Assume that the disclosure risk for the categorical key variables is 1. Then after applying the local suppression to the categorical key variables and the Microaggregation to the continuous key variables, we get the anonymized data. Now we can get a new disclosure risk for the categorical key variables and assume it is 0.8.

And now we measure the disclosure risk for the continuous key variables, assume it is 10.

Then, I undo the the previous two SDC methods of local suppression and Microaggregation, then I apply the Recoding to the the categorical key variables and the Noise Addition to the continuous key variables. Now we can get a new disclosure risk for the categorical key variables and assume it is 0.4 which is lower than before.

And now we get the disclosure risk for the continuous key variables which is 20 which is however higher than before.

In this case, how can I compare this two ways of anonymizing data because the risk measurement for the categorical key variables and the continuous key variables indicate a contradiction. So which way is better?

bernhard-da commented 6 years ago

hi @orchardlucky

1) yes, i was referring to categorical key variables when using the term key-variables. Slot keyVars in the object contains the (numerical) index of categorical key variables given the original input data set (available in slot origData)

2) i don't think that there is a "better". risk-measures are always based on comparing the current state of your "anonymized" dataset (given all the methods that you've applied) against the original, unmodified data. In your example, you can judge that the application of noise has more impact on the numerical key variables (eg. lower risk and also lower data utilty) as the microaggregation. Also, there are different risk measures available for either categorical or numerical key variables. But I do not see why you have a "contradiction" here ....

orchardlucky commented 6 years ago

hi @bernhard-da

Thank you very much.

  1. By contradiction I mean, as you can image a circumstance. One day, my boss come to me and ask me to protect the original data by two ways. The first way is local suppression plus Microaggregation. The second way is Recoding plus Noise Addition. And also ask me to tell him the conclusion which way is better for the whole data.

So as you can see, for the second way, the the disclosure risk for the continuous key variables is higher than before. and let us further assume the data utility for the continuous key variables is higher than before.

However for the second way, the disclosure risk for the categorical key variables is lower than before. let us further assume the data utility for the categorical key variables is lower than before

So based on the disclosure risk, we can see for the continuous key variables the first way is better. for the the categorical key variables the second way is better.

But based on the data utility , we can see for the continuous key variables the second way is better. For the the categorical key variables the first way is better.

It confuses me. I do know for the continuous key variables the second way is better.

But What if my boss ask me that for the whole dataset which way is better. I do not know the answer......

  1. Does lower risk always mean lower data utility?

Best

bernhard-da commented 6 years ago

@orchardlucky

well, i do not think that there is a or can be a clear answer in this case. as you described: you can have methods that have different impact on either categorical or numerical variables.

You might (in advance) decide on a global risk measure (basically choosing how important numerical key variables are for you and how important categorical key variables are for the intended use case of the data) and compute this for different combinations of anonymization methods and decide on the combination that gives you the best (overall) result.

Anyway, I am closing this here because this has not directly todo with sdcMicro any longer.