privacytoolsproject / PSI-Library

R library of differentially private algorithms for exploratory data analysis
6 stars 7 forks source link

Fix bug in dpMean$release to fix "variable 'fun' not found" error #49

Open MeganFantes opened 5 years ago

MeganFantes commented 5 years ago

Bug found 6/4

error:

When running the dp-mean.Rmd vignette, at line boot_mean$release(PUMS5extract10000), get the error:

Error in formals(targetFunc) : object 'fun' not found

source:

mechanism-bootstrap.R, line 41:

mechanismBootstrap$methods(
    bootStatEval = function(xi) {
        fun.args <- getFuncArgs(fun, inputList=list(...), inputObject=.self)
        input.vals = c(list(x=x), fun.args)
        stat <- do.call(boot.fun, input.vals)
        return(stat)
})

getFuncArgs uses a variable called fun, but there is no fun parameter in the method signature.

MeganFantes commented 5 years ago

tracing the bug:

the boot_mean object is created in dp-mean.Rmd:

boot_mean <- dpMean$new(mechanism='mechanismBootstrap', var.type='numeric', 
                        variable='income', n=10000, epsilon=0.1, rng=c(0, 750000), 
                        n.boot=n.boot)

Then we have the call:

boot_mean$release(PUMS5extract10000)

which refers to the release method of a dpMean object called boot_mean. This throws the error above.

release method of dpMean object in statistic_mean.R:

dpMean$methods(
    release = function(data, ...) {
        x <- data[, variable]
        sens <- diff(rng) / n
        .self$result <- export(mechanism)$evaluate(mean, x, sens, .self$postProcess, ...)
})

export(mechanism) exports all fields and methods of the mechanism passed into the object. In this case, the mechanism is the mechanismBootstrap, which has the method evaluate

evaluate method of the mechanismBootstrap class in mechanism-boostrap.R:

mechanismBootstrap$methods(
    evaluate = function(fun, x, sens, postFun) {
        x <- censordata(x, .self$var.type, .self$rng)
        x <- fillMissing(x, .self$var.type, .self$impute.rng[0], .self$impute.rng[1])
        epsilon.part <- epsilon / .self$n.boot
        release <- replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=.self$bootStatEval))
        std.error <- .self$bootSE(release, .self$n.boot, sens)
        out <- list('release' = release, 'std.error' = std.error)
        out <- postFun(out)
        return(out)
})

Interesting to note that the ... operator is passed into evaluate, but ... is not in the method signature

According to the stack trace from the error, the problem here is the replicate() method. This method repeats the bootstrap.replication method n.boot times. The problem is in bootstrap.replication.

bootstrap.replication function in mechanism-boostrap.R:

bootstrap.replication <- function(x, n, sensitivity, epsilon, fun) {
    partition <- rmultinom(n=1, size=n, prob=rep(1 / n, n))
    max.appearances <- max(partition)
    probs <- sapply(1:max.appearances, dbinom, size=n, prob=(1 / n))
    stat.partitions <- vector('list', max.appearances)
    for (i in 1:max.appearances) {
        variance.i <- (i * probs[i] * (sensitivity^2)) / (2 * epsilon)
        stat.i <- fun(x[partition == i])
        noise.i <- dpNoise(n=length(stat.i), scale=sqrt(variance.i), dist='gaussian')
        stat.partitions[[i]] <- i * stat.i + noise.i
    }
    stat.out <- do.call(rbind, stat.partitions)
    return(apply(stat.out, 2, sum))
}

fun(x[partition == i]) calls the function that was passed in, which was bootStatEval

Here, I wanted to figure out what x[partition == i] actually means. x is a vector of values indicating the income of each person in the original dataset. partition == i is a vector of booleans. x[partition == i] should be a subset of x, with the values at the indices with TRUE at the original value, and the rest at 0. This is mostly true, except the values at FALSE seem to be random values. I think this is because of a differentially private protocol? (I figured this out using many print statements throughout the code)

bootStatEval method of the mechanismBootstrap class in mechanism-bootstrap.R:

mechanismBootstrap$methods(
    bootStatEval = function(xi) {
        fun.args <- getFuncArgs(fun, inputList=list(...), inputObject=.self)
        input.vals = c(list(x=x), fun.args)
        stat <- do.call(boot.fun, input.vals)
        return(stat)
})

I think I found the problem:

The mean function is passed into evaluate() and then nothing is done with it. Instead, the function passed into replicate() is set to fun=.self$bootStatEval.

Then, in bootstrap.replication, the function applied to x[partition == i] is bootStatEval.

In the replicate() function call, we do not want to repeat bootStatEval n times, we want to calculate the mean n times, that is what bootstrapping is.

I think the call to bootStatEval should be a hard-coded call somewhere in bootstrap.replication, because bootStatEval is a sanity check (I think?) it is not a parameter that needs to be passed around. mean as the function we are interested in bootstrapping is a parameter we would want to pass around, because we will want to bootstrap different values eventually.

The error happens because bootStatEval expects a parameter called fun, but no such parameter is passed in. I think this fun is the mean function passed into evaluate().

(Similarly, a ... operator is passed into evaluate() and then never used (evaluate does not have a ... in its method signature). Eventually bootStatEval will look for a ... operator and will not find one, and I think the ... from the evaluate() function call is it.)

MeganFantes commented 5 years ago

fixed the bug:

in mechanism-bootstrap.r:

changed evaluate = function(fun, x, sens, postFun) { to: evaluate = function(fun, x, sens, postFun, ...) {

changed release <- replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=.self$bootStatEval)) to: replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=fun, inputObject = .self, ...))

changed bootstrap.replication <- function(x, n, sensitivity, epsilon, fun) { to: bootstrap.replication <- function(x, n, sensitivity, epsilon, fun, inputObject, ...) { and added: @param inputObject the Bootstrap mechanism object on which the input function will be evaluated

changed stat.i <- fun(x[partition == i]) to: stat.i <- inputObject$bootStatEval(x[partition == i], fun, ...)

changed bootStatEval = function(xi) { to: bootStatEval = function(xi, fun, ...) {

changed input.vals = c(list(x=x), fun.args) to: input.vals = c(list(x=xi), fun.args)

changed stat <- do.call(boot.fun, input.vals) to: stat <- do.call(fun, input.vals)

MeganFantes commented 5 years ago

Now the dp-mean vignette runs, but the bootstrapped mean will occasionally return NaN as the result

MeganFantes commented 5 years ago

The NaNs being produced are from when the partition vector is created in bootstrap.replication. Sometimes when the partition vector is created, one partition is empty.

MeganFantes commented 5 years ago

Fixed all problems. Added validation in bootstrap.replication to ensure it is only calculating a statistic for a partition that contains values.