Code snippet in documentation does not work

opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data

MIT License

254 stars 68 forks source link

I tried to run the code in the "Preprocessor hints" section of this documentation page, and it fails for multiple reasons:

The categorical_columns.remove(['income', 'age') line is a syntax error
Even when adding a ] or removing the [, the code is still invalid — instead, one must use two different calls to remove, one with 'income' as the sole argument and one with 'age' as the sole argument.
The disjoint_set dependency, used in mst.py, is not installed automatically when running pip install smartnoise-synth, so the code snippet fails with ModuleNotFoundError: No module named 'disjoint_set'.
After running pip install disjoint-set, it fails for a different import error: ModuleNotFoundError: No module named 'networkx'
After running pip install networkx, it fails for a different reason:

Please install mbi with:
   pip install git+https://github.com/ryan112358/private-pgm.git
Spent 0.5 epsilon on preprocessor, leaving 0.5 for training
Fitting with 367200 dimensions
Traceback (most recent call last):
  File "approx_bounds_synth.py", line 15, in <module>
    nullable=True
  File "/home/damien/.pyenv/versions/3.7.16/lib/python3.7/site-packages/snsynth/base.py", line 105, in fit_sample
    **kwargs
  File "/home/damien/.pyenv/versions/3.7.16/lib/python3.7/site-packages/snsynth/mst/mst.py", line 97, in fit
    domain = Domain(colnames, cards)
NameError: name 'Domain' is not defined

At that point, running the suggested command (pip install git+https://github.com/ryan112358/private-pgm.git) fixes the error, and I can run the code snippet. But then, why is that "please run the following comment" simply printed to stdout, as opposed to thrown as an exception, so the end-user only sees this? Furthermore, why isn't it part of the automatically installed dependencies?

Thanks for reporting this. We will update the documentation and consider how to best alert people when they're trying to run a synthesizer that doesn't have the correct dependencies installed.

Regarding "why isn't it part of the automatically installed dependencies?", there is a tension between having all synthesizers work immediately out of the box versus providing only what's needed, to reduce attack surface area. This isn't to say that smartnoise-synth's current defaults are particularly coherent or intentional, but this is an important decision we are trying to think through. Considering that these synthesizers are often run in "eyes-off" environments with strict security controls and especially sensitive data, we're leaning towards shifting the defaults to be as minimal as possible. For example, maybe pip install smartnoise-synth installs only mwem, or even installs no synthesizers at all, and people need to say pip install smartnoise-synth[mst, aim, mwem] to get specific synthesizers. There would be an option like, pip install smartnoise-synth[all] which would install all synthesizers, supporting data scientists who need to compare all of the synthesizers to decide what's the best option to run in the eyes-off environment. And, of course, the security review and threat model would then be able to focus on the specific synthesizer that was selected in the evaluation phase. In the production environment, the pip install would need to specify the selected synthesizer. Basically, making the on-boarding experience for people kicking the tires slightly more cumbersome (have to know the pip install smartnoise-synth[all] syntax), in exchange for making it marginally more difficult for devops people deploying code in eyes-off environments to shoot themselves in the foot.

Since this is something we are still thinking through, and your team has a lot of real-world experience, your feedback is welcome.

opendp / smartnoise-sdk

Code snippet in documentation does not work #573