Integrate: answer to question "can observations be made public" into preprocessors and release text

raprasad commented 3 years ago

[ ] 1. create a checklist table of current stats and if/how the computation changes when the # of observations cannot be made public

See google doc: https://docs.google.com/document/d/1xUihcjh4zmfnhG0-2EC-uG-qzpde8WXphRksB0NvHe8/edit#

(Redo steps below after doc discussion)

[ ] ~~2. update the StatSpec class (stat_spec.py) to include a variable indicating is_dataset_size_public~~
[ ] ~~3. ^ update the computation chains for existing stats appropriately.~~
- [ ] ~~e.g. if the is_dataset_size_public == True, update the chain, use a different chain, etc.~~
- [ ] ~~include tests for each stat. (Check taht if the dataset size is private then more epsilon is used, etc.~~
[ ] ~~4. Integrate into larger workflow. e.g. ValidateReleaseUtil.build_stat_specs()~~
- ~~ValidateReleaseUtil.__init__ : add self.is_dataset_size_public = None~~
- ~~ValidateReleaseUtil.run_preliminary_steps: set self.is_dataset_size_public to True or False~~
- ~~Add function DatasetInfo.is_dataset_size_public()~~
  - ~~similar to get_dataset_size()~~
  - ~~except finds answer to the dataset question within DepositorSetupInfo~~
- ~~ValidateReleaseUtil.build_stat_specs(), user self.is_dataset_size_public when building the StatSpec objects~~

ecowan commented 2 years ago

There are two avenues here, each with its own set of logical steps:

Using DP Count:

When the user selects private count = True, then the "create statistic" view should be pre-populated with a row for a DP count, the result of which will be passed into any other statistics that the user selects
If the user selects private count = True and in "create statistic" selects a count, it should override the pre-populated one - we only need this to be calculated once.

Using User Estimation:

One of the views (likely create statistic) needs a way for the user to specific their best estimation for the count, which is then passed to the backend and used in the computation chains.
If a DP Count is also requested, then we would need to decide which takes precedence.

@raprasad @ekraffmiller

Thanks to @Shoeboxam for the discussion

ecowan commented 2 years ago

Needed for computing DP counts:

Select any one of the columns in the data set
Set a parameter (epsilon/10, etc.) that determines how much budget should be used to calculate the count estimate
Construct a new class with similar functionality to ValidateReleaseUtil that can return a DP count only
Result of this class needs to be passed into ValidateReleaseTool to be used in the resize step of each statistic
ValidateReleaseUtil also needs to lower the maximum_epsilon based on how much was used by the DP count

raprasad commented 2 years ago

An old slide. We're not getting user input--yet.

This ticket is for implementing the green box labeled: "Use privacy budget to capture size"

ecowan commented 2 years ago

@raprasad Why don't we approach this incrementally, and first build a feature where the user has to answer yes. This way, we can first develop the part of the code that takes the estimate from the front end and passes it into the process. Once this is merged, we can add functionality for the case where they say "no".

ecowan commented 2 years ago

Another option is to create 2 analysis objects, one for the dp count and one for the rest, and split the budget between them. This way we could reuse the existing ValidateReleaseUtil class to compute what we need, rather than creating new classes to compute the dp count separately.

The workflow could look like this:

User selects "count is private"
Make two API calls to create new analyses, and link them to each other
When dp count analysis completes, save the dp count to the analysis object
When the second analysis runs, look to the linked analysis object and take the dp count from it

opendp / dpcreator

Integrate: answer to question "can observations be made public" into preprocessors and release text #295