opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
244 stars 64 forks source link

Add Input Checks for pre_aggregated #421

Closed joshua-oss closed 2 years ago

joshua-oss commented 2 years ago

Spark DataFrames and RDDs might not be executed until the caller requests rows from the result of execute(). Pulling the first row of data to check types, then re-running for the map() to produce output, would result in the query used for pre_aggregated being run twice, which could be very expensive. One way to avoid double-execution would be to do the type and column checking inside the row map.

The implementation currently ignores column names on pre_aggregated, because the column names in the typical case are generated by the private reader (including names with random strings), and all values are extracted from the subquery result recordset positionally. This could lead to errors if the caller passes in pre-computed aggregates in a different order than what the private reader is expecting (e.g. correct number of columns, and both integer, but they are swapped). This would be tricky for the caller to debug. And, since the values being passed in are pre-aggregated, we have no way of checking to see what expression was used to compute each column. However, we can use some heuristic to compare passed-in column names with the names that would've been used in the typical case, and throw error or warning if they don't match.

joshua-oss commented 2 years ago

Done