stanleybak / vnncomp2021

VNN Neural Network Verification Competition 2021
37 stars 22 forks source link

Add eran benchmark #5

Closed mnmueller closed 3 years ago

mnmueller commented 3 years ago

New (random) instances of the "eran" benchmark can be created with "specs_from_seed.sh". Additional CIFAR10 specs can be generated with "specs_from_seed_cifar.sh".

huanzhang12 commented 3 years ago

@mnmueller Thank you for adding the CIFAR10 specs!

For this pull request, perhaps it's better to include only the "eran" benchmark. How about keeping the scripts for the large CIFAR10 in a separated folder that is not part of this pull request (since they are not part of the "eran" benchmark), and give pointers in your comments in #2? Thanks again!

mnmueller commented 3 years ago

I split the benchmarks into two folders, adding a "cifar" benchmark and am now using @pat676 specs of last years cifar networks. I could also create two separate pull requests if that is required.

stanleybak commented 3 years ago

for the cifar, can we use a different short_name, as cifar is a bit to generic, maybe reuse the name from the report last year, which I think was GGN-CNN or GGN-CNN-VNNCOMP2020?

stanleybak commented 3 years ago

Also, how do the cifar specs correspond to the ones from last year? In the report (VNN_COMP2020_Report.pdf), in GGN-CNN, CIFAR_2_255 goes up to image index 100, but in the instances.csv files, the max one I see is cifar10_spec_idx_88_eps_0.00784_n1.vnnlib, which I guess is index 88? also what is convBigRELU__PGD.onnx? Is this the same as MNIST from GGN-CNN last year? In the instances.csv I see 216 instances defined, but Figure 3 in the report has over 300 instances.

Finally, to generate new images, is there any way to get the specs from last year? or will these be a completely different set of images?

mnmueller commented 3 years ago

I mentioned in the main discussion, that we would not be able to compare directly to last years results if we draw random samples to generate our specs but think/thought we should still do so if someone champions them, including them in the evaluation. The example specs were obtained by drawing samples starting with idx 0 until 72 correctly classified samples were reached for every of the 3 networks. I updated the script now to include an option to look at the first 100 samples and only include the correctly classified ones in the spec, which should reproduce last years benchmark (Note that Figure 3 combines MNIST and CIFAR specs leading to more samples). What gets generated now when no seed is provided are the CIFAR specs from last year + an additional network.

The 3 networks are the two GGN-CNNs from last year and a slightly larger PGD trained conv network with 4 convolutional layers. As suggested in the main discussion, if we decide to only use the former two, simply deleting the corresponding line in the bash script, will produce that benchmark.