sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 23 forks source link

dRisk.R and issue in interval measure definition #351

Closed leodecarlo closed 3 months ago

leodecarlo commented 3 months ago

Dear Developers,

Me and a colleague think that there is a problem with the dRisk.R method in the sdcMicro library:

dRisk_link ,

the guide on interval measure :

interval_measure

says " intervals are created around each perturbed value and then a determination is made as to whether the original value of that perturbed observation is contained in this interval." we agree that this is what the lines from 84 to 87 do in dRisk_link . Which count 1 when x is inside the created interval around x_m and 0 when x is outside.

But we find that the next lines in interval_measure seem to say something from what the method does:

" Values that are within the interval around the initial value after perturbation are considered too close to the initial value and hence unsafe and need more perturbation. Values that are outside of the intervals are considered safe. "

and

"The result 1 indicates that all (100 percent) the observations are outside the interval of 0.1 times the standard deviation around the original values." ,

namely it refers to intervals created around the original values, while the intervals are created around the perturbed values x_m, and it says that a value is counted as 0 when inside and 1 when outside, while we understand the function is doing the opposite around x_m.

We paste the following R script to test the strange behavior of the function, where increasing the noise in the perturbed values, the dRisk() method gives 1 for very high noise and very small values for very low noise. Here the script:

library(sdcMicro)

keys <- c('sex', 'age')
num_var <- c('expend')

sdc1<-createSdcObj(dat=testdata2, keyVars = keys, numVars = num_var)

set.seed(100)
out <- addNoise(sdc1, noise = 500)
high_noise <- out@risk$numeric

set.seed(100)
out <- addNoise(sdc1, noise = 0.001)
out@risk$numeric
low_noise <- out@risk$numeric

sprintf("Level of anonimity with insignificant noise %f. Level of anonymity with high noise %f", low_noise, high_noise)

So we think that a part of the guide should be changed and the dRisk.R() method should be changed or not along the actual intention (i.e. it can stay like it is if the guide changes meaning or not).

bernhard-da commented 3 months ago

@leodecarlo hi, thanks for your question. I agree that there is some kind of ambiguity. I want to note that the sdcguide is not written by the maintainers of sdcMicro so I would suggest to create an issue for the authors of the guide here