Closed orchardlucky closed 6 years ago
hi @orchardlucky
manipulating values directly within the sdcMicro-object is possible, though not suggested because risk-and utility measures don't re-calculate if you do so.
to modify the data, see the following example:
## create obj
library(sdcMicro)
data(testdata)
sdc <- createSdcObj(testdata,
keyVars=c('urbrur','roof','walls','water','electcon','relat','sex'),
numVars=c('expend','income','savings'), w='sampling_weight')
## apply a anon-method for cat- and numerical vars
sdc <- kAnon(sdc, k=4)
sdc <- addNoise(sdc,variables=c("expend","income"))
## manipulated data are stored in slots named 'manipXYZ'
slotNames(sdc)
[1] "origData" "keyVars" "pramVars" "numVars"
[5] "ghostVars" "weightVar" "hhId" "strataVar"
[9] "sensibleVar" "manipKeyVars" "manipPramVars" "manipNumVars"
[13] "manipGhostVars" "manipStrataVar" "originalRisk" "risk"
[17] "utility" "pram" "localSuppression" "options"
[21] "additionalResults" "set" "prev" "deletedVars"
## modified key variables in slot @manipKeyVars, this is just a data.frame
df <- get.sdcMicroObj(sdc, "manipKeyVars"); head(df, 3)
urbrur roof walls water electcon relat sex
2 4 3 3 1 1 1
2 4 3 3 1 2 2
2 4 3 3 1 3 1
## modify and update sdcMicro-Obj
df[1,1] <- NA
slot(sdc, "manipKeyVars") <- df
head(get.sdcMicroObj(sdc, "manipKeyVars"), 3)
urbrur roof walls water electcon relat sex
NA 4 3 3 1 1 1
2 4 3 3 1 2 2
2 4 3 3 1 3 1
## same appraoch also for manipulated numerical key variables available
## in slot @manipNumVars
df <- get.sdcMicroObj(sdc, "manipNumVars"); head(df, 3)
expend income savings
128066004 71023.44 116258.5
29366780 -26631903.83 279345.0
-34111570 90253281.70 5495381.0
## update and modify the sdc-obj
df[1,1] <- NA
slot(sdc, "manipNumVars") <- df
head(get.sdcMicroObj(sdc, "manipNumVars"), 3)
expend income savings
NA 71023.44 116258.5
29366780 -26631903.83 279345.0
-34111570 90253281.70 5495381.0
In case you just want to add some variables after your anonymization process is finished, you can also use ?extractManipData
to return the protected (manipulated) data and modify this data.frame.
Dear @bernhard-da,
Thanks.
I have some further questions.
The first one is: Whether the parameter "keyVars" in the 'createSdcObj' function refers to the categorical key variables?
You said in the previous answer that ' ## modified key variables in slot @manipKeyVars, this is just a data.frame'. Whether the 'key variables' in this sentence also refers to the categorical key variables? So key variables is short for categorical key variables?
My another question is that:
Let's assume at first, we have a original data set. Is it right that we can only measure the disclosure risk based on the categorical key variables before applying any SDC methods to it? We can measure the disclosure risk based on the continuous key variables just after applying the SDC methods, is it right?
Assume that the disclosure risk for the categorical key variables is 1. Then after applying the local suppression to the categorical key variables and the Microaggregation to the continuous key variables, we get the anonymized data. Now we can get a new disclosure risk for the categorical key variables and assume it is 0.8.
And now we measure the disclosure risk for the continuous key variables, assume it is 10.
Then, I undo the the previous two SDC methods of local suppression and Microaggregation, then I apply the Recoding to the the categorical key variables and the Noise Addition to the continuous key variables. Now we can get a new disclosure risk for the categorical key variables and assume it is 0.4 which is lower than before.
And now we get the disclosure risk for the continuous key variables which is 20 which is however higher than before.
In this case, how can I compare this two ways of anonymizing data because the risk measurement for the categorical key variables and the continuous key variables indicate a contradiction. So which way is better?
hi @orchardlucky
1) yes, i was referring to categorical key variables when using the term key-variables. Slot keyVars
in the object contains the (numerical) index of categorical key variables given the original input data set (available in slot origData
)
2) i don't think that there is a "better". risk-measures are always based on comparing the current state of your "anonymized" dataset (given all the methods that you've applied) against the original, unmodified data. In your example, you can judge that the application of noise has more impact on the numerical key variables (eg. lower risk and also lower data utilty) as the microaggregation. Also, there are different risk measures available for either categorical or numerical key variables. But I do not see why you have a "contradiction" here ....
hi @bernhard-da
Thank you very much.
So as you can see, for the second way, the the disclosure risk for the continuous key variables is higher than before. and let us further assume the data utility for the continuous key variables is higher than before.
However for the second way, the disclosure risk for the categorical key variables is lower than before. let us further assume the data utility for the categorical key variables is lower than before
So based on the disclosure risk, we can see for the continuous key variables the first way is better. for the the categorical key variables the second way is better.
But based on the data utility , we can see for the continuous key variables the second way is better. For the the categorical key variables the first way is better.
It confuses me. I do know for the continuous key variables the second way is better.
But What if my boss ask me that for the whole dataset which way is better. I do not know the answer......
Best
@orchardlucky
well, i do not think that there is a or can be a clear answer in this case. as you described: you can have methods that have different impact on either categorical or numerical variables.
You might (in advance) decide on a global risk measure (basically choosing how important numerical key variables are for you and how important categorical key variables are for the intended use case of the data) and compute this for different combinations of anonymization methods and decide on the combination that gives you the best (overall) result.
Anyway, I am closing this here because this has not directly todo with sdcMicro any longer.
We can extract the manipulated data in the object of class ' sdcMicroObj' after applying the SDC methods. I wonder whether we can change the values of the manipulated data in the object of class ' sdcMicroObj' directly without applying any SDC methods. Thanks.