joshua-oss commented 2 years ago

This issue is for discussion of the API for data transforms.

Overview

PROBLEM STATEMENT: Various differentially private data processing routines have different expectations about the format of input data, so it's often necessary to preprocess and postprocess. For example, a GAN-based synthesizer expects categorical values to be one-hot encoded, while MWEW-style synthesizers need all values to be discretized into integer values. Pushing these requirements to the analyst gives the analyst greatest control, but can lead to privacy leaks, since many common preprocessing approaches use summary information about the data.

GOAL: Provide a set of safe reversible data transforms that make it easy for analysts to adapt data for the processing algorithms in SmartNoise, particularly for the synthesizers. Data transforms should be able to run independently; with minimal or no coupling to synthesizers. Transforms should keep track of all privacy spend, to allow efficient accounting.

PROPOSAL: Adopt the basic interface of RDT [1] for column-value transforms. Specifically, all column-transforms have fit(), transform(), and reverse_transform() methods. Allow both chaining of column transformers on a single column, as well as aggregation of multiple column transformers to transform a table (e.g. a TableTransformer patterned after SDT's HyperTransformer)

Base Column Transformer

Will use the name ColumnTransformer here for discussion purposes; other names may be better.

class ColumnTransformer:
   def fit(self, data)
   def transform(self, data)
   def reverse_transform(self, data)

Constructor

Instances of concrete ColumnTransformer implementations may have instance-level state that is used to keep track of transformation parameters or privacy budget. Instance-level state can be provided in the constructor, or may be changed in the fit() method. For example, a transformer that uses public bounds to scale the value could use syntax like:

transform = MinMaxScaler(min=0, max=100)

while the same transformer could spend some privacy budget to infer the bounds, like:

transform = MinMaxScaler()
transform.fit(pums, 'age', epsilon=0.1)

The transform and reverse_transform functions do not change any instance state. The output schema of transform will be the input schema of reverse_transform and vice-versa, and these two schemas will not usually be the same. For example, a BinTransform may take floats and return sequential integers. A OneHot transformer will take sequential integers and return multiple bit columns.

RDT provides a helper fit_transform which performs a fit followed by a transform. With privacy-safe transforms, the two most common scenarios will be, A) the fit method spends privacy budget, so it should be called sparingly, or B) the transform doesn't need to fit, so the fit method won't do anything. Because of this, it's not clear how useful fit_transform would be.

Common Instance Variables

Transforms in RDT typically allow per-column null processing [2] to be specified via two parameters: missing_value_replacement (which specifies how to replace or impute) and model_missing_values (which specifies that a new indicator column should be created). We can adopt this pattern, with two minor changes: missing_value_replacement can only support direct substitution, rather than "mean", "median", "mode". We could theoretically allow some budget to spend inferring these, but it's hard to imagine a scenario where that would make sense. Likewise, the model_missing_values flag would need to always create the indicator column, even if no null values exist in the data, because failure to create the column could leak privacy.

All RDT transformers expose get_input_sdtype() which specifies the expected value type for the input column. This returns only a single string value, which would presumably map to the string representation of a Python dtype. This would presumably present some problems for a transformer that supports multiple input types, such as a binning transform that can process int or float. The RDT numerical.py transformer specifies "numerical" as an input sdtype. This doesn't correspond to any Python datatype (that I'm aware of), so it's presumably a convention that is specific to RDT. If we want to expose expected input types, we can consider exposing as a list, or create/borrow some explicitly defined hierarchy of types.

For outputs, the RDT transformers do support multiple types, as get_output_sdtypes(). However, this appears to be designed only to support one dtype per column, and the plural nature of the call is designed for cases where the transform creates a new column or columns (such as when adding a null indicator column). This might be OK, since a transformer that can convert from multiple input types will probably choose a single canonical output type.

Input Dataset Types

The RDT design allows different data to be passed in to all three of the fit, transform, reverse_transform methods. This seems like a good choice.

We deviate from RDT, in that we want to support more input types for data. RDT currently requires data to be a pandas dataframe with a column name. Suggest we support:

pandas df+column name (as in RDT)
numpy array
iterable of scalar
iterable of tuples+column index? (for example, could point to a SQL cursor or Spark RDD)
tensor? PyTorch DataSet?

The default behavior should be for the transform method to return the same type supplied, or as close as possible. This will require some thought. For example, if a list of strings (rperesenting categories) is provided to a label encoder, the label encoder will return a list of integers. However, if that list of integers is passed to a onehot encoder, the result should be understood to include multiple columns. In this case, we could make a choice to return 'k' lists, with each list representing a column, or we could return a list of 'n' arrays of size 'k', where each item in the list represents a row, and each array is a one-hot encoded entry.

For input types that are only iterable, the straightforward choice is to return 'n' tuples of size 'k', since the transform will be processed and returned one row at a time. It probably makes sense to use the same layout for input lists or arrays, though the output in this case should be rturned as a list or array rather than an iterable. In this case, the only scenario where it might make sense to return an array of 'k' columns would be in the pandas case.

As much of the conversion as possible should be handled in the base class, to simplify implementation requirements for the concrete implementations. The items in the list above are descending priority order, with lower priority items being optional. The goal is to make sure the design doesn't tie our hands and preclude eventual support. Two things that seem important to get right from the start are, 1) support inputs that don't have column names and 2) support inputs that are forward-only, streaming.

Concrete Implementations

StandardScaler: Transforms numeric values into a range between 0.0 and 1.0, or -1.0 and +1.0. Can take a public min/max or estimate using privacy budget
BinTransformer: Transforms numeric values into a fixed number of bins. Can take a public min/max or estimate using privacy budget. Reverse transform will need to decide how to "un-bin" values (e.g. midpoint, random, etc.). Fixed-width bins may be suboptimal for many datasets, which increases incentive for the analyst to choose better binning and perhaps make privacy mistakes. We should consider future updates like exponential (e.g. zipf) binning, or binning where values below or above a certain threshold are binned with rest uniform, or z-scale style binning in cases where distribution information is public.
NullTransformer: Copy the RDT one, with changes as described above
LabelEncoder: Same output as BinTransformer, but operates over strings. Could support combining rare labels into "other" category by using some privacy spend.
CategoryToOneHot, OneHotToCategory: (int->bits) and (bits->int)
Boolean: Same as in RDT
DateTime: Same as in RDT; converts datetime to a number
Passthrough: No transform, just passes values through
Identifier: Strips identifiers on transform, generates new identifier on reverse transform.
Faker: strips PII on transform, generates fake PII on reverse
others ???

These are roughly in descending priority order.

Chaining Transformers

Chaining will be needed in some scenarios. For example, to do appropriate comparison between MWEM and GAN, the analyst might wish to use a binning transform for MWEM, and the same bins followed by one-hot for the GAN. One easy way to support this would be to make something like a ChainTransformer that inherits from ColumnTransformer and takes a list of transformers in the constructor. This is a lower priority right now, since it will take some time working with real world examples to get it right. For example, matching the right input and output dtypes becomes trickier. Analysts can work around the lack of chaining by running multiple transforms in sequence.

We should probably implement chaining before implementing too many implicit conversions, though. For example, rather than allowing OneHot to optionally take as input string-coded category names, OneHot should support this by automatically chaining a label encoder and a one-hot encoder.

Table Transformer

The TableTransformer is just a list of ColumnTransformers. As with individual column transforms, we need to handle both named and positional columns.

The table transformer has fit, transform, and reverse_transform, though the input data represents a tabular data source with multiple columns, rather than a single column. We deviate from RDT in that we do not support multiple relational tables. Only one table can be transformed at a time, independently.

One requirement for table transformation is support for row-based. Streaming data implies that the input data are provided as a row-oriented source, such as an iterable over tuples or something like an RDD that supports map() over rows. The ColumnTransformer design described above already supports this iterable pattern where values are transformed and returned on-demand, one at a time. In the case of a TableTransformer, however, the next() method would be called on the source recordset and not on the individual data passed to each ColumnTransformer. This is in fact an issue with any row-based representation of the dataset, even when not using iterators or map(). For example, if the dataset is represented as an in-memory list of tuples, with each tuple representing a row, the TableTransformer would need a way to convert the dataset to individual columns to feed the transformers. A straightforward way to support this would be to ensure that all ColumnTransformer have an atomic transform and reverse_transform that can be called in functional style, with the list-level implementations simply being a map() of the atomic operation. I can't think of any scenarios where a transform or inverse_transform would depend on values that come before or after in the stream (independent of any parameters learned in fit).

The fit functions do require iterating over the full source data, so calling fit() on a TableTransformer with row-based input would potentially result in 'k' ColumnTransformers being fit in parallel, as the table transformer feeds each column transformer an atomic value at a time. Many transformers will have a do-nothing fit function, and those that need to fit will usually be amenable to streamed processing. If we design fit for ColumnTransformers as described above, with the input dataset being simply an iterable over scalar in streaming scenarios, the TableTransformer could build generators for each column to accomplish the streaming. Alternately, the fit design could be modified to support incremental fit. The best approach here is not obvious.

Interface With Synthesizers

Feeding a synthesizer is basically a process of constructing the appropriate transform chains from the source data to match what the synthesizer expects. This depends on the source scheme and the synthesizer metadata. Synthesizers will report errors if inputs don't match the expected format, but it would be nice to have some sort of factory that can create the appropriate transforms, given a source schema and desired synthesizer or synthesizers.

dataset = conn.execute("SELECT * FROM Orders")
mwem = Synthesizer.create("mwem", epsilon=1.0)
transform = TableTransformer.create(dataset, mwem, epsilon=1.0)
transform.fit(dataset)
mwem.fit(transform.transform(dataset))
synthetic = transform.reverse_transform(mwem.sample(500))

In this example, the create method is a factory method that inspects the schema of the source dataset, and queries the synthesizer for metadata about what is expected, then creates the expected TableTransformer.

This code could probably be simplified. One thing to note here is that mwem and the table transformer both set an epsilon of 1.0, so the total privacy spend could be as high as 2.0, though it could be lower, depending on how much each step uses. This means that we need some internal glue to account for total spend.

It's also possible that analysts will want to extend budget by (for example) using disjoint partitions for different operations, such as:

dataset_2_pct = conn.execute("SELECT * FROM Orders WHERE OrderID % 50 == 1")
dataset_98_pct = conn.execute("SELECT * FROM Orders WHERE OrderID % 50 != 1")

mwem = Synthesizer.create("mwem", epsilon=1.0)
transform = TableTransformer.create(dataset, mwem, epsilon=1.0)
transform.fit(dataset_2_pct)
mwem.fit(transform.transform(dataset_98_pct))
synthetic = transform.reverse_transform(mwem.sample(500))

In this case, the maximum spend would be 1.0. There isn't any obvious way to handle this for the analyst, so it will be up to the analyst to properly track spend.

[1] https://github.com/sdv-dev/RDT [2] https://github.com/sdv-dev/RDT/blob/master/rdt/transformers/null.py

lurosenb commented 2 years ago

Thanks so much for this in depth design review Joshua!! Awesome place to start with the Transformers - I'm adding some thoughts below. Assume +1s for everything not explicitly mentioned!:

Let’s have a simple fit_transform wrapper method, and be clear about why we expect that scenario of spending budget to perform a transform to be uncommon! See below for why.

As for supported datatypes, let's try the following to start:

pandas df+column name (as in RDT)
numpy array
iterable of scalar
iterable of tuples+column index (for example, could point to a SQL cursor or Spark RDD)

Let’s not do tensors, as they play pretty well with numpy, and have some idiosyncrasies that could slow us down.

+1 to implementing chaining transformers from the onset - we don’t have to work out all the edge cases, but this will make our lives easier in the future.

+1 to the row based Table Transformer scheme described. We have to be careful, if we decide that the best path forward is atomic ColumnTransformers, that the ColumnTransformer remains lightweight (so that fit isn’t needlessly slow or memory intensive when ‘k’ is large) - this seems doable.

Also, re: fit_transform, the code example with synthesizer seems to have a common scenario where it would be useful and help simplify:

dataset = conn.execute("SELECT * FROM Orders")
mwem = Synthesizer.create("mwem", epsilon=1.0)
transform = TableTransformer.create(dataset, mwem, epsilon=1.0)
mwem.fit(transform.fit_transform(dataset))
synthetic = transform.reverse_transform(mwem.sample(500))

Finally, the “internal glue” here for privacy budget spend is going to be challenging. I could imagine some sort of accountant class that’s not explicitly called/created by a user, but is instantiated and passed around by internal transforms. It could perform some simple disjoint set checks to give better than naive bounds, and other low cost accounting. Perhaps its time to look at ways in which https://github.com/microsoft/prv_accountant makes sense (if at all)?