privacytrustlab / ml_privacy_meter

Privacy Meter: An open-source library to audit data privacy in statistical and machine learning algorithms.
MIT License
557 stars 99 forks source link

Allow users to specify datasets as numpy arrays #60

Closed amad-person closed 2 years ago

amad-person commented 2 years ago

Updated the AlexNet tutorial with the new code for dataset handling.

Thoughts on data handling for the upgrade:

  1. Different sources of data that need to be handled: (a) target train and test data, (b) population/aux data for the attack.
  2. The target model's data can be a subset of the population data, or have some overlapping data with the population data.
  3. So we need to have some way of removing the overlapping data from the population data. Currently we have a buggy way of hashing the string version of the train data and removing these data points from the population data if the same hashes are present.
  4. A better way would be to let the user specify the overlapping data point indices, if there are any.

Example workflow:

targetDataset = TargetDataset(x_train, y_train, x_test, y_test)
populationDataset = PopulationDataset(x_population, y_population, target_train_indices)

Both TargetDataset and PopulationDataset can inherit from a parent Dataset class, which will have the actual tf-datasets code for creating and using a dataset.