opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
254 stars 68 forks source link

Code snippet in documentation does not work #573

Closed TedTed closed 1 year ago

TedTed commented 1 year ago

I tried to run the code in the "Preprocessor hints" section of this documentation page, and it fails for multiple reasons:

Please install mbi with:
   pip install git+https://github.com/ryan112358/private-pgm.git
Spent 0.5 epsilon on preprocessor, leaving 0.5 for training
Fitting with 367200 dimensions
Traceback (most recent call last):
  File "approx_bounds_synth.py", line 15, in <module>
    nullable=True
  File "/home/damien/.pyenv/versions/3.7.16/lib/python3.7/site-packages/snsynth/base.py", line 105, in fit_sample
    **kwargs
  File "/home/damien/.pyenv/versions/3.7.16/lib/python3.7/site-packages/snsynth/mst/mst.py", line 97, in fit
    domain = Domain(colnames, cards)
NameError: name 'Domain' is not defined
joshua-oss commented 1 year ago

Thanks for reporting this. We will update the documentation and consider how to best alert people when they're trying to run a synthesizer that doesn't have the correct dependencies installed.

Regarding "why isn't it part of the automatically installed dependencies?", there is a tension between having all synthesizers work immediately out of the box versus providing only what's needed, to reduce attack surface area. This isn't to say that smartnoise-synth's current defaults are particularly coherent or intentional, but this is an important decision we are trying to think through. Considering that these synthesizers are often run in "eyes-off" environments with strict security controls and especially sensitive data, we're leaning towards shifting the defaults to be as minimal as possible. For example, maybe pip install smartnoise-synth installs only mwem, or even installs no synthesizers at all, and people need to say pip install smartnoise-synth[mst, aim, mwem] to get specific synthesizers. There would be an option like, pip install smartnoise-synth[all] which would install all synthesizers, supporting data scientists who need to compare all of the synthesizers to decide what's the best option to run in the eyes-off environment. And, of course, the security review and threat model would then be able to focus on the specific synthesizer that was selected in the evaluation phase. In the production environment, the pip install would need to specify the selected synthesizer. Basically, making the on-boarding experience for people kicking the tires slightly more cumbersome (have to know the pip install smartnoise-synth[all] syntax), in exchange for making it marginally more difficult for devops people deploying code in eyes-off environments to shoot themselves in the foot.

Since this is something we are still thinking through, and your team has a lot of real-world experience, your feedback is welcome.

joshua-oss commented 1 year ago

Updated and pushed docs to docs.smartnoise.org