Fix bug in dpMean$release to fix "variable 'fun' not found" error

MeganFantes commented 5 years ago

Bug found 6/4

error:

When running the dp-mean.Rmd vignette, at line boot_mean$release(PUMS5extract10000), get the error:

Error in formals(targetFunc) : object 'fun' not found

source:

mechanism-bootstrap.R, line 41:

mechanismBootstrap$methods(
    bootStatEval = function(xi) {
        fun.args <- getFuncArgs(fun, inputList=list(...), inputObject=.self)
        input.vals = c(list(x=x), fun.args)
        stat <- do.call(boot.fun, input.vals)
        return(stat)
})

getFuncArgs uses a variable called fun, but there is no fun parameter in the method signature.

MeganFantes commented 5 years ago

tracing the bug:

the boot_mean object is created in dp-mean.Rmd:

boot_mean <- dpMean$new(mechanism='mechanismBootstrap', var.type='numeric', 
                        variable='income', n=10000, epsilon=0.1, rng=c(0, 750000), 
                        n.boot=n.boot)

Then we have the call:

boot_mean$release(PUMS5extract10000)

which refers to the release method of a dpMean object called boot_mean. This throws the error above.

release method of dpMean object in statistic_mean.R:

dpMean$methods(
    release = function(data, ...) {
        x <- data[, variable]
        sens <- diff(rng) / n
        .self$result <- export(mechanism)$evaluate(mean, x, sens, .self$postProcess, ...)
})

export(mechanism) exports all fields and methods of the mechanism passed into the object. In this case, the mechanism is the mechanismBootstrap, which has the method evaluate

evaluate method of the mechanismBootstrap class in mechanism-boostrap.R:

mechanismBootstrap$methods(
    evaluate = function(fun, x, sens, postFun) {
        x <- censordata(x, .self$var.type, .self$rng)
        x <- fillMissing(x, .self$var.type, .self$impute.rng[0], .self$impute.rng[1])
        epsilon.part <- epsilon / .self$n.boot
        release <- replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=.self$bootStatEval))
        std.error <- .self$bootSE(release, .self$n.boot, sens)
        out <- list('release' = release, 'std.error' = std.error)
        out <- postFun(out)
        return(out)
})

Interesting to note that the ... operator is passed into evaluate, but ... is not in the method signature

According to the stack trace from the error, the problem here is the replicate() method. This method repeats the bootstrap.replication method n.boot times. The problem is in bootstrap.replication.

bootstrap.replication function in mechanism-boostrap.R:

bootstrap.replication <- function(x, n, sensitivity, epsilon, fun) {
    partition <- rmultinom(n=1, size=n, prob=rep(1 / n, n))
    max.appearances <- max(partition)
    probs <- sapply(1:max.appearances, dbinom, size=n, prob=(1 / n))
    stat.partitions <- vector('list', max.appearances)
    for (i in 1:max.appearances) {
        variance.i <- (i * probs[i] * (sensitivity^2)) / (2 * epsilon)
        stat.i <- fun(x[partition == i])
        noise.i <- dpNoise(n=length(stat.i), scale=sqrt(variance.i), dist='gaussian')
        stat.partitions[[i]] <- i * stat.i + noise.i
    }
    stat.out <- do.call(rbind, stat.partitions)
    return(apply(stat.out, 2, sum))
}

fun(x[partition == i]) calls the function that was passed in, which was bootStatEval

Here, I wanted to figure out what x[partition == i] actually means. x is a vector of values indicating the income of each person in the original dataset. partition == i is a vector of booleans. x[partition == i] should be a subset of x, with the values at the indices with TRUE at the original value, and the rest at 0. This is mostly true, except the values at FALSE seem to be random values. I think this is because of a differentially private protocol? (I figured this out using many print statements throughout the code)

bootStatEval method of the mechanismBootstrap class in mechanism-bootstrap.R:

mechanismBootstrap$methods(
    bootStatEval = function(xi) {
        fun.args <- getFuncArgs(fun, inputList=list(...), inputObject=.self)
        input.vals = c(list(x=x), fun.args)
        stat <- do.call(boot.fun, input.vals)
        return(stat)
})

I think I found the problem:

The mean function is passed into evaluate() and then nothing is done with it. Instead, the function passed into replicate() is set to fun=.self$bootStatEval.

Then, in bootstrap.replication, the function applied to x[partition == i] is bootStatEval.

In the replicate() function call, we do not want to repeat bootStatEval n times, we want to calculate the mean n times, that is what bootstrapping is.

I think the call to bootStatEval should be a hard-coded call somewhere in bootstrap.replication, because bootStatEval is a sanity check (I think?) it is not a parameter that needs to be passed around. mean as the function we are interested in bootstrapping is a parameter we would want to pass around, because we will want to bootstrap different values eventually.

The error happens because bootStatEval expects a parameter called fun, but no such parameter is passed in. I think this fun is the mean function passed into evaluate().

(Similarly, a ... operator is passed into evaluate() and then never used (evaluate does not have a ... in its method signature). Eventually bootStatEval will look for a ... operator and will not find one, and I think the ... from the evaluate() function call is it.)

MeganFantes commented 5 years ago

fixed the bug:

in mechanism-bootstrap.r:

changed evaluate = function(fun, x, sens, postFun) { to: evaluate = function(fun, x, sens, postFun, ...) {

Add the ... operator to the method signature, so we can pass it to getFuncArgs() later

changed release <- replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=.self$bootStatEval)) to: replicate(.self$n.boot, bootstrap.replication(x, n, sens, epsilon.part, fun=fun, inputObject = .self, ...))

Change the fun to the input function (in this case, mean)
Add a parameter inputObject to pass the bootstrap mechanism object to bootstrap.replication so we can call bootStatEval later
Add the ... operator to pass to bootStatEval later

changed bootstrap.replication <- function(x, n, sensitivity, epsilon, fun) { to: bootstrap.replication <- function(x, n, sensitivity, epsilon, fun, inputObject, ...) { and added: @param inputObject the Bootstrap mechanism object on which the input function will be evaluated

Add a parameter inputObject so we can call bootStatEval
Add the ... operator
Add a line to the documentation noting the new input parameter

changed stat.i <- fun(x[partition == i]) to: stat.i <- inputObject$bootStatEval(x[partition == i], fun, ...)

Add input parameters that will be passed to bootStatEval

changed bootStatEval = function(xi) { to: bootStatEval = function(xi, fun, ...) {

Add input parameters to the method signature that will be passed to bootStatEval

changed input.vals = c(list(x=x), fun.args) to: input.vals = c(list(x=xi), fun.args)

Update the variable name to xi instead of x

changed stat <- do.call(boot.fun, input.vals) to: stat <- do.call(fun, input.vals)

Update the variable name to fun instead of boot.fun

MeganFantes commented 5 years ago

Now the dp-mean vignette runs, but the bootstrapped mean will occasionally return NaN as the result

MeganFantes commented 5 years ago

The NaNs being produced are from when the partition vector is created in bootstrap.replication. Sometimes when the partition vector is created, one partition is empty.

MeganFantes commented 5 years ago

Fixed all problems. Added validation in bootstrap.replication to ensure it is only calculating a statistic for a partition that contains values.

privacytoolsproject / PSI-Library