opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
254 stars 69 forks source link

PostgresReader does not work properly? #298

Closed giulianopietoso-apheris closed 4 years ago

giulianopietoso-apheris commented 4 years ago

After setting up the PUMS dataset in my Postgres database I tried running the example:

from opendp.smartnoise.sql import PostgresReader, PrivateReader
from opendp.smartnoise.metadata import CollectionMetadata

meta = CollectionMetadata.from_file('PUMS.yaml')

query = 'SELECT married, AVG(income) AS income, COUNT(*) AS n FROM PUMS GROUP BY married'
query = 'SELECT AVG(age) FROM PUMS'

reader = PostgresReader('127.0.0.1', 'PUMS', '---')
private_reader = PrivateReader(reader, meta, 1.0)

exact = reader.execute_typed(query)
print(exact)

private = private_reader.execute_typed(query)
print(private)

First issue is related to the PostgresSerializer:

Traceback (most recent call last):
  File "call_posrgres.py", line 9, in <module>
    reader = PostgresReader('127.0.0.1', 'PUMS', '---')
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/sql/reader/postgres.py", line 14, in __init__
    super().__init__(PostgresNameCompare(), PostgresSerializer())
NameError: name 'PostgresSerializer' is not defined

This was a simple indentation problem in the class, which makes it break instantly before anything else happens. I searched for any open issues or PRs related to this in the repository but found nothing, which makes me wonder if anyone is actually using the PostgresReader at all.

https://github.com/opendifferentialprivacy/smartnoise-sdk/blob/master/sdk/opendp/smartnoise/sql/reader/postgres.py#L56

Any how, I fixed this locally so I could continue testing the example.

Output traceback error:

Traceback (most recent call last):
  File "call_posrgres.py", line 15, in <module>
    private = private_reader.execute_typed(query)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/sql/private_reader.py", line 134, in execute_typed
    query = self.parse_query_string(query_string)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/sql/private_reader.py", line 77, in parse_query_string
    queries = QueryParser(self.metadata).queries(query_string)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/sql/parse.py", line 34, in queries
    q.load_symbols(metadata)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/ast/ast.py", line 47, in load_symbols
    r.load_symbols(metadata)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/ast/ast.py", line 207, in load_symbols
    r.load_symbols(metadata)
  File "/home/workspace/anaconda3/envs/dev_smartnoise/lib/python3.7/site-packages/opendp/smartnoise/ast/ast.py", line 286, in load_symbols
    table = metadata[str(self.name)]
TypeError: 'PostgresReader' object is not subscriptable

PUML.yaml

Database:
  public:
    PUMS:
      rows: 1000
      age:
        type: int
        lower: 0
        upper: 100
      sex:
        type: string
      educ:
        type: int
      race:
        type: string
      income:
        type: int
        lower: 0
        upper: 500000
      married:
        type: boolean
      pid:
        type: int
        private_id: True

Database table

CREATE TABLE PUMS (age int, sex char(2), educ int, race char(2), income float, married boolean, pid int )

Has anyone faced anything similar or could shed some light on how to fix this?

eedeleon commented 4 years ago

Hi the second issue is due to parameter ordering. We realized the order should be flipped, and here we are with it causing a confusing error in your code.

Here is an example call from our test code

return PrivateReader(metadata, reader, budget).execute(query)

We plan on making it PrivateReader(reader, schema) very soon to avoid this confusion with warning based back compat for a couple releases to avoid breaking users

eedeleon commented 4 years ago

Our more common use case utilizes the SparkReader, this does not mean that these bugs should be let through so I created an issue to resolve the bugs

https://github.com/opendifferentialprivacy/smartnoise-sdk/issues/300

joshua-oss commented 4 years ago

Hi there,

It does, indeed look like lines 56-58 in Postgres.py were indented incorrectly. After fixing that indentation error, I was able to make your example run.

Note that I needed to insert some records into the database table so that it would have some results:

CREATE TABLE PUMS.PUMS (age int, sex char(2), educ int, race char(2), income float, married boolean, pid int );
INSERT INTO pums.pums VALUES (35, 'm', 9, 'l', 90000, true, 1) ;
INSERT INTO pums.pums VALUES (30, 'f', 1, 'w', 30000, false, 2) ;
INSERT INTO pums.pums VALUES (55, 'm', 13, 'b', 60000, true, 3) ;
INSERT INTO pums.pums VALUES (18, 'f', 8, 'l', 50000, false, 4) ;
INSERT INTO pums.pums VALUES (7, 'm', 12, 'w', 30000, true, 5) ;
INSERT INTO pums.pums VALUES (75, 'f', 13, 'a', 190000, false, 6) ;
INSERT INTO pums.pums VALUES (105, 'm', 3, 'b', 900000, true, 7) ;

Then the following code works.

from opendp.smartnoise.sql import PrivateReader, PostgresReader
from opendp.smartnoise.metadata import CollectionMetadata

meta = CollectionMetadata.from_file("service/datasets/PUMS.yaml")
reader = PostgresReader("localhost", "pums")
#query = "SELECT AVG(age) AS age FROM pums.pums"
query = "SELECT COUNT(*) AS n FROM pums.pums"
res = reader.execute(query)
print(str(res))
priv = PrivateReader(meta, reader)
res2 = priv.execute(query)
print(str(res2))

I used schema name pums, rather than public, but your example should work fine with the public schema.

In the example above, I am calling execute on both the PostgresReader and the PrivateReader, to make it easy to compare exact results with differentially private results.

Note that differentially private AVG() will be very noisy for small denominators; you will want to be in the several hundreds to thousands of records range before the AVG() is accurate. Counts will be more accurate for smaller n, but you will still notice the counts jump around for such a small table.

We will do a scrub to try to understand how that spurious indent crept into the code, and update tomorrow. Thanks for reporting this!

eedeleon commented 4 years ago

The patch mentioned is on pypi now version 0.1.1

giulianopietoso-apheris commented 4 years ago

The issue was indeed the order of the parameters. Thank you very much for the help and now its working. Glad to report the issue on the identification.