Bounded Functions fail when large dataset given

chinmayshah99 commented 4 years ago

When I use BoundedMean<int> and pass a large data to the function, due to large data, the int overflows and gives the output as the lowerbound.

How to reproduce:

#include <vector>
#include <iostream>
#include "absl/flags/flag.h"
#include "absl/strings/str_format.h"
#include "algorithms/util.h"
#include "algorithms/bounded-mean.h"
#include "proto/util.h"
#include "base/statusor.h"
#include "proto/confidence-interval.pb.h"
#include "proto/data.pb.h"
using absl::PrintF;
using differential_privacy::GetValue;
namespace dp=differential_privacy;
int main(int argc, char **argv) {
  auto mean_algorithm = dp::BoundedMean<int>::Builder().SetEpsilon(1.0).SetLower(0).SetUpper(10).Build().ValueOrDie();
  for (const int v : std::vector<int>(600000000, 5)) {
    mean_algorithm->AddEntry(v);
  }
  std::cout<< dp::GetValue<double>(mean_algorithm->PartialResult().ValueOrDie())<<std::endl;
}

Possible fix: Use SafeAdd() when adding each element to the vector and whenever it throws an error, generate an error.

dasmdasm commented 4 years ago

Fortunately, this isn't a DP violation because the noise additional step can also overflow. That said, it's not great from an accuracy perspective.

We can't return an error on overflow because that would be a deterministic data-dependent behavior. We can, however, cap everything at the maximum and minimum values and prevent them from wrapping around. That could potentially lead to data loss (once we hit the maximum value, adding more input does nothing), but it's more accurate than returning very large-magnitude negative values.

chinmayshah99 commented 4 years ago

Keeping the overflow caused by noise addition aside, though it is a deterministic behavior, it would be a better idea to raise some kind of warning as the user would keep on adding data without realizing its already overflowed. This is the decision we need to make.

dasmdasm commented 4 years ago

I don't think we're willing to raise an error or return a warning in violation of the DP guarantees. Doing that would compromise differential privacy for every client of the library, whether they want that behavior or not. Instead, I'd suggest defaulting to 64-bit integers - they're substantially harder to overflow than 32-bit ones.

dibakch commented 3 years ago

Closing this as the workaround here is to use a larger data type. Additionally we also added a wrap-around for the current implementation, so that this should work by now.

google / differential-privacy

Bounded Functions fail when large dataset given #40