Code to accompany: "Data Isotopes for Data Provenance in DNNs".
This repository uses a conda environment. To set it up (assuming conda
is installed):
$ conda env create --name=venv --file=environment.yml
$ conda activate venv
Our code uses the ffcv library to train CIFAR100 models, so you will have to ensure your system is compatible with that library.
This codebase provides code to reproduce results for GTSRB, CIFAR100 and PubFig datasets as shown in the original paper. It also supports running experiments on CIFAR10 (not evaluated in original paper). We have provided config files to recreate specific experiments (see the reproducing experiments section below.) Below, we provide general information on setting up the codebase.
tr_folder
and ts_folder
parameters in the config/scrub/*.yml
files to ensure they point to your local Scrub dataset on your local host. Each dataset has an associated folder in ./configs
. There is a default.yaml
file in each config folder, which contains the default model and training parameters for each dataset (see Table 6 in the paper Appendix) in the single and multi-tag setting.
If you want to train the PubFig model, you need to download the SphereFace model checkpoint from this link and put it in src/models/
.
imagenet_path
variable in configs/default.yaml
to point towards the folder containing the Imagenet validation data (or you can add an additional argument --imagenet_path <Imagenet validation data path>
when running experiments).To run one-off experiments with our code, you use the main.py
script, which accepts command line arguments specifying the dataset, mark type, model, training settings, etc.
However, it would be easier for you to write your own .yml
files containing the experiments you want to run. Then, you can use the run_on_gpus.py
script to load the .yml
file and associated experiment settings and run this experiment.
Single experiment example: For example, if one wanted to recreate the multi-mark CIFAR100 experiments in Figure 10 of our paper, they could run the following command python3 run_on_gpus.py configs/cifar100/multi/default.yaml --gpu_ls 0 --max_gpu_num 1
. This will run the experiment on gpu 0 of your localhost.
Multi-processing example: If you want to easily spread the work among different GPUs, run python3 run_on_gpus.py configs/cifar100/multi/default.yaml --gpu_ls 0123 --max_gpu_num 4
. Assuming you have $4$ gpus on your system, this will launch $1$ run on each GPU and wait until it finds an empty GPU before it starts the next run in the list.
If you want to reproduce the key evaluation experiments from Sections 5.2 and 5.3 in our paper, use the table below to map specific experiments to specific .yaml
files in the configs
folder. The .yaml
files have the same name for each dataset (if the dataset is used for that experiment), so if you want to run experiment Z for dataset X in setting Y (single or multi tag), you would run something like python3 run_on_gpus.py ./config/X/Y/Z.yaml --gpu_ls 01 --max_gpu_num 2
.
Table/Figure | Datasets | Config file names |
---|---|---|
Figure 8 | GTSRB, CIFAR100, PubFig, Scrub | ./configs/<DATASET>/single/default.yaml |
Table 4 | CIFAR100 | ./configs/cifar100/multi/ablation_same_cla.yaml |
Figure 9 | PubFig | ./configs/pubfig/multi/ablation_perc_alpha.yaml |
Table 3 | GTSRB, CIFAR100, PubFig, Scrub | ./configs/<DATASET>/multi/default.yaml |
All code in this repository is licensed under the MIT license.