opendp / smartnoise-core

Differential privacy validator and runtime
MIT License
290 stars 33 forks source link

fix gaussian scaling/accuracy, resolve #248, integer queries #315

Closed Shoeboxam closed 3 years ago

Shoeboxam commented 3 years ago

This PR fixes two privacy issues-

  1. A sign in the analytic gaussian was not being set correctly, resulting in reduced sigma
  2. The noise scale in the gaussian mechanism should not be squared

This PR fixes a number of issues with accuracy-

  1. Accuracies for disjoint queries were incorrectly multiplied by the category length.
  2. Accuracies for the gaussian mechanism were incorrectly using values for the analytic gaussian
  3. Accuracies would not coerce integers to floats when necessary.

This PR also adds shapeness checks into the mechanism runtimes. Typically we don't write checks in the runtime- the validator handles this logic, and the runtime simply executes the computation. Checks that fail at runtime also have a tendency to leak information about the data. However, in this case, these checks provide a second layer of protection against mis-shapen data that was manually inserted into the graph. This comes up often when inserting pre-aggregated data from other systems. It is better to fail than to only noise the intersection of the actual and expected shapes.

This PR also computes integral sensitivities when possible, and improves support for integer-valued queries.

mikephelan commented 3 years ago

validator-rust/src/components/simple_geometric_mechanism.rs was the decision made to no longer provide support for multi-dimensional arrays? for dictionaries? what will happen if someone calls the geometric mechanism with a multi-dimensional array as an argument? I do not see any error handling for that situation.

mikephelan commented 3 years ago

validator-rust/src/utilities/mod.rs are you no longer measuring slope and ensuring Lipshitz continuity in sensitivity scaling? What is the impact of leaving that out?

mikephelan commented 3 years ago

validator-rust/src/components/gaussian_mechanism.rs what is the difference between the meaning of delta versus delta_thr ? throughout gaussian_mechanism.rs, you fluctuate between using >= or > as your pivot point when using a formula containing delta and delta_thr as your pivot point

mikephelan commented 3 years ago

runtime-rust/src/components/mechanisms.rs the way in which you have returned the error when the sensitivity list does not have the same shape as the original list does not appear to attempt any recovery. Should you be using panic! if you are not attempting to recover? in Evaluable for Exponential Mechanism, your sequence of conditional issuances of errors is not DRY and attempts no recovery

mikephelan commented 3 years ago

runtime-rust/src/components/cast.rs runtime-rust/src/components/mechanisms.rs You appear to have created some conversion utility code around Float numbers, in utility areas. Is it possible that over time the decision to make these helper functions in utility areas might create a need to create additional conversion utilities for other types, ultimately making for utility code that is verbose and requires maintenance over time? What is driving the decision to cease using library-based functionality to treat float numbers?

Shoeboxam commented 3 years ago

https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721940752 Our mechanisms only work on multidimensional arrays (and have always only worked on multidimensional arrays). The additional checking here is to ensure that the expected shape of the data matches the actual shape of the data- if you are working completely within the library, this is always true. However, if you are substituting pre-aggregated data in from outside of the library, it is easy to make a mistake and transpose the data, or leave off an axis. If you make a mistake like this, it is much better to error than to just not noise data outside of the expected shape.

The system may appear to work with dictionaries (in the context of partitioning), but internally the mechanisms are mapped over the values. If the data input is not an array, it fails when .array()? is called.

I just added a check to ensure the dimensionality of the input data is no greater than 2.

Shoeboxam commented 3 years ago

Comment 2 on: https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721940752 At data loading time, axis zero represents users. Upon aggregation, axis zero represents disjoint aggregations. You can construct disjoint aggregations through two methods:

  1. histograms
  2. via partitioning, applying aggregators on partitions (there is a mapping shorthand here), and unioning aggregations

In both cases, the sensitivity is scaled to compensate. This is why all of the mechanisms (except for the exponential) work with multidimensional arrays, not just scalar values.

Shoeboxam commented 3 years ago

https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721941377 Yes, there is a system for tracking lipschitz constants, but it is currently not used- when a user attempts to apply transformations to aggregated data, the system throws an error instead of modifying the lipschitz constants. Since the lipschitz constants are always one, that code was always a no-op.

Commented because sensitivities may now be (lossless!) integers and the snip only works with floats. I fixed it and did some testing. For now, lipschitz constants remain one, but if they were to change, the sensitivity would be corrected.

Shoeboxam commented 3 years ago

https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721941846 The strictness of the equality should not matter, because strict equality is checked first:

let alpha = if delta == delta_thr {
    1.
} else {
    ...

Regardless, I changed the code to be consistent, and added comments and links to the paper throughout.

Shoeboxam commented 3 years ago

https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721942356 I never intentionally panic under any condition, because I always prefer the error to bubble up to the calling process where the user can handle it. Should I ever panic (even unintentionally, like an out-of-bounds access or invalid cast), the python interpreter itself crashes. So I make every effort to create results- even in many situations where I don't anticipate panics being possible due to program state. The stack traces for these results are passed all the way back through FFI and embedded in an exception created in the bindings language.

Shoeboxam commented 3 years ago

https://github.com/opendifferentialprivacy/smartnoise-core/pull/315#issuecomment-721942930 In cases where I use the utility function, I want either a numeric or an error if the data is not already numeric. I also use the utility function in the validator, and the validator does not have a dependency on the runtime. So I'd need to pull that function up into the validator, and expose it in the crate API. If we weren't investing so much in the opendp library, I would make a case to merge the validator and runtime into one crate, and gate the entire runtime behind a conditional compilation flag. This would also let me make a significant amount of the crate private. This would not have a particularly large effect on reducing the base utility impls though.

ecowan commented 3 years ago

#315 (comment) I never intentionally panic under any condition, because I always prefer the error to bubble up to the calling process where the user can handle it. Should I ever panic (even unintentionally, like an out-of-bounds access or invalid cast), the python interpreter itself crashes. So I make every effort to create results- even in many situations where I don't anticipate panics being possible due to program state. The stack traces for these results are passed all the way back through FFI and embedded in an exception created in the bindings language.

This is a very important factor for us to keep in mind - having unintentionally caused the python interpreter to crash and not understanding why, we should never panic in Rust, since it makes for a confusing user experience.

ecowan commented 3 years ago

@Shoeboxam This looks great, I've also run all tests as a sanity check and can confirm everything is running and passing.