sdcTools / sdcMicro

sdcMicro
http://sdctools.github.io/sdcMicro/
79 stars 23 forks source link

Risk does not update in numeric risk calculation #306

Closed matthias-da closed 4 years ago

matthias-da commented 4 years ago

By preparing a course, I just noted that all methods for numeric key variables gives wrong estimates of the risk (i.e. risk is not changing, but should). This is new, previously this always worked correctly.

library(simPop)
data("eusilcP")
sdc2 <- createSdcObj(eusilcP,  
  keyVars = c("age", "gender", "ecoStat", 
              "citizenship"), 
  numVars = "eqIncome",   
  pramVars = "region", 
  hhId = "hid"                
)

n <- nrow(eusilcP)
head(eusilcP$eqIncome)
head(eusilcP$eqIncome * rnorm(n, mean = 1, sd = 10)) # HUGE Noise

set.seed(123)
eqInc_noise <- eusilcP$eqIncome * rnorm(n, mean = 1, sd = 10) # HUGE Noise
sdc2@manipNumVars$eqIncome <- eqInc_noise
sdc2 <- calcRisks(sdc2)
print(sdc2, "numrisk")

--> Nothing changed in reported risk, also in sdc@risk$numeric nothing changed, even we added a HUGE noise.

head(as.numeric(sdc2@origData[, sdc2@numVars]))

11128.45 11128.45 19694.85 19694.85 5066.24 31480.01

head(sdc2@manipNumVars)

-51243.79 -14486.73 326680.17 33581.38 11616.27 571382.56

Thus numeric risk should be low now, but still it reports 100% (1).

In case anybody would find time for this, this would be great.

matthias-da commented 4 years ago

P.S. this also happens when addNoise() or microaggregation are directly applied on the sdcMicroObj object. Also here the risk does not update.

matthias-da commented 4 years ago

P.S. it works for other data sets. But so far, nothing seems to be wrong with eusilcP. The numeric key variable is of class numeric, thus it should also work for this data set.

matthias-da commented 4 years ago

Another example: (also does not update the risks)

sdc2 <- createSdcObj(eusilc13puf,
  keyVars = c("age","rb090","pl031","pb220a"),
  numVars = "pgrossIncome",
  pramVars = "db040", # this is region
  hhId = "db030",
  weightVar = "weight"
)
sdc2 <- microaggregation(sdc2)
print(sdc2, "numrisk")
sdc2 <- addNoise(sdc2, noise = 50000)
print(sdc2, "numrisk")
# also no update of the risk
# try
sdc2 <- calcRisks(sdc2)
print(sdc2, "numrisk")
# neither
matthias-da commented 4 years ago

P.S.: eusilc13puf was used from package simPop

bernhard-da commented 4 years ago

@matthias-da

in dRisk.R (starting from line 75) we actually want to compute a the risk measure as the ratio of observations for which at least one original value of numerical key variables x lies within intervals mi and ma that are computed using perturbed values xm . if an observation is not perturbed "enough", we consider it risky. the relevant code is

w <- which(rowSums(x < mi | x > ma, na.rm = TRUE) %in% 0:1)
as.numeric(length(w)/nrow(x))

which in my opinion is wrong and would be required to be changed to:

w <- rowSums(x < mi | x > ma, na.rm = TRUE) == 0
as.numeric(sum(w) / nrow(x))

so basically it is not that the risks are not updating but that they are (wrongly) computed. what do you think?

matthias-da commented 4 years ago

Definitely. Thank you. I wonder why we got reasonable results in the past with it.

It should be just

w <- rowSums(x < mi | x > ma, na.rm = TRUE) == 0
mean(w)

Already uploaded it. Thanks again.

matthias-da commented 4 years ago

Fixed thanks to Bernhard.