Benchmarks Discussion - Githubissues

stanleybak commented 3 years ago

Use this issue to describe your benchmark and solicit feedback. A benchmark consists of a one or more ONNX network files, one or more VNNLIB property files. For each instance (combination of .onnx and .vnnlib. file), a timeout should also be suggested, where the sum of all the timeouts is less than six hours.

For image classification benchmarks, rather than picking a specific set of images, it's suggested to describe how to generate the .vnnlib file from a specific image (for example, using a fixed epsilon disturbance value and check for a different maximum output). For the competition we would then sample images from MNISt/CIFAR at random, to prevent tailoring of tools to specific images.

Also, indicate if there should be input scaling, channel orders, or anything else. It's probably a good idea to provide an example input or image and expected output, to make sure the execution semantics for the network are clear.

For reference, here are the topics from last year's benchmark discussion: https://github.com/verivital/vnn-comp/issues/1 https://github.com/verivital/vnn-comp/issues/2 https://github.com/verivital/vnn-comp/issues/3

stanleybak commented 3 years ago

To start the discussion I'll propose a few examples from last year. Here's a link to the preferred format of benchmarks, on a simple example with VNNLIB: link

Here's the ACAS Xu system properties 1-4 that we analyzed last year in this format: link

The benchmark directories includes three parts: a set of .onnx network files, a set of .vnnlib spec files, and a .csv listing each combination of .onnx and .vnnlib file that needs to be analyzed, in addition to the per-instance timeout. Total timeout for all instances should be equal to six hours.

I'll also provide one example soon showing how to do image classifiers using DNNV on standard image classification systems.

naizhengtan commented 3 years ago

Hi all, I'd like to propose our benchmark: verifying learned indexes of databases. The full proposal is here. I will briefly introduce this benchmark below.

Background: learned index is a neural network (NN) based database index proposed by this paper, 2018. It has great potential but also has one drawback due to NNs---for non-existing keys (in the database), the outputs of a learned index can be arbitrary.

What we do: to tame NN's uncertainty, we design a specification to dictate how "far" one predicted position can be, compared to its actual position (or the positions that non-existing keys should be).

What to verify: our benchmark provides multiple pairs of (i) learned indexes (trained NNs) and (ii) corresponding specifications. We design these pairs with different parameters such that they cover different user needs and have varied difficulties for verification tools. Please see an example here.

A note on broader impact: using NNs for systems is a broad topic, but many existing works lack strict safety guarantees. We believe that NN Verification can help system developers gain confidence in NNs to apply them to critical systems. We hope our benchmark can be an early step towards this vision.

pat676 commented 3 years ago

Hi all, we want to propose the same three benchmarks as in VNN-COMP 20.

In short, the three benchmarks consist of MNIST classifiers with 2, 4, and 6 hidden layers, respectively, and 256 ReLU nodes in each hidden layer. We suggest 25 images with perturbation radii 0.03 and 0.05 and a timeout of 7 minutes. Note that we changed the radius ( VNN_COMP 20 used 0.02 and 0.05) since perturbations with radius 0.02 often were very easy to verify.

The networks and vnnlib properties for the first 25 images from the MNIST test-set can be found in https://github.com/pat676/vnn-21-fc-mnist

We also added a script "generate_properties.py", which can be used to create properties for random images by setting the random parameter to "True" in the if name == 'main' block at the end of the script.

changliuliu commented 3 years ago

Hi all, I'd like to propose our benchmark: verifying learned indexes of databases. The full proposal is here. I will briefly introduce this benchmark below.

Background: learned index is a neural network (NN) based database index proposed by this paper, 2018. It has great potential but also has one drawback due to NNs---for non-existing keys (in the database), the outputs of a learned index can be arbitrary.

What we do: to tame NN's uncertainty, we design a specification to dictate how "far" one predicted position can be, compared to its actual position (or the positions that non-existing keys should be).

What to verify: our benchmark provides multiple pairs of (i) learned indexes (trained NNs) and (ii) corresponding specifications. We design these pairs with different parameters such that they cover different user needs and have varied difficulties for verification tools. Please see an example here.

A note on broader impact: using NNs for systems is a broad topic, but many existing works lack strict safety guarantees. We believe that NN Verification can help system developers gain confidence in NNs to apply them to critical systems. We hope our benchmark can be an early step towards this vision.

Following up on this benchmark, there are two important features of this benchmark that we would like to seek participants' inputs on:

In this benchmark, the specifications (.vnnlib) contain IF-ELSE conditions (as a bunch of conditions that if the input belongs to a range, the output should satisfies certain constraint). There are many IF-ELSE conditions that if we break them into separated .vnnlib files, there will be too many files and too much overhead to operate these files. Here we would like to see if participants are ok with these IF-ELSE conditions in .vnnlib?
The networks used in this benchmark are big sparse networks. The sparsity does not come from CNN or other type of predefined structures in onnx, and onnx currently does not support arbitrary sparse matrix. So these networks are stored as dense matrices, which can go as big as 500MB (~15MB after zipping). Will any participant has any concern on operating these big networks?

pat676 commented 3 years ago

Hi all, we also want to suggest some larger Cifar10 convolutional networks. We've created an updated repository with the two Cifar10 nets from last year's competition as well as vnnlib properties for the first 100 images and scripts to create random properties. The repo also contains the mnist-fc networks suggested earlier:

https://github.com/pat676/vnn-comp21-benchmarks

The vnnlib files for the same properties as in last year's competition can be generated by running generate_all_properties.sh and vnnlib files based on random images can be created by running generate_all_properties_random.sh.

We also feel that it is important to have some classifiers with Sigmoid and Tanh activation functions. We'll update our repo with some suggestions tomorrow.

EDIT: We have updated our repo with one MNIST Sigmoid and one Tanh classifier. The networks were used for benchmarking in https://vas.doc.ic.ac.uk/papers/20/ecai20-HL.pdf and are originally from https://github.com/eth-sri/eran.

Based on the results from https://vas.doc.ic.ac.uk/papers/20/ecai20-HL.pdf, we suggest two epsilons 0.025 and 0.035, and 25 images for each epsilon.

wu-haoze commented 3 years ago

Hi, we'd like to suggest three cifar10 convolutional networks. The benchmarks are in the following repo: https://github.com/anwu1219/VNN-COMP21-benchmarks We also included a script to generate targeted l-inf robustness query taking the test image index and the epsilon as inputs. The architectures of the networks are specified in https://github.com/anwu1219/VNN-COMP21-benchmarks/blob/master/info.txt The network expects the input image to be (32x32x3) and normalized between[0,1].

Based on our evaluation, we suggest two epsilons 0.012 and 0.024, and 6 minutes timeout for each epsilon.

pat676 commented 3 years ago

Thanks, @anwu1219, we have a few questions regarding the benchmarks.

We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network?

Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.

net

pat676 commented 3 years ago

@changliuliu, @naizhengtan

Thanks, @naizhengtan for the suggestion. While we find the benchmarks interesting, we also agree with @changliuliu that the format indeed seems unfamiliar and may in practice require quite an effort from every participant. To encourage as many toolkits as possible to participate, we suggest that we keep the benchmarks in this competition similar to benchmarks that have been used in previous papers and are widely supported already.

wu-haoze commented 3 years ago

Thanks, @anwu1219, we have a few questions regarding the benchmarks.

We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network?

Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.

Hi @pat676,

Thanks for trying to load the network in your toolkit. The reason for the transpose node in the network is that the networks were originally stored in tensorflow graphdef format (which uses the channel order NHWC) and were transformed to onnx format, whose operators expects NCHW, to conform to the competition format.

I'm not sure that I understand the point about the sequential layer. Attached is a visualization of the architecture of the cifar10_large.onnx network using Netron. I believe the node types in the onnx file are transpose, conv, relu, reshape, matmul and add, which are fairly common operators. Please do let me know if I am missing anything. Screenshot from 2021-05-02 15-16-06

Regarding the dropout layer, it was added for certain layers at training time, but it does not play a role at inference time. The onnx files should be agnostic of whether the networks were trained using dropouts or not.

huanzhang12 commented 3 years ago

Apologize for submitting our benchmarks a little bit late - I hope our benchmarks are still useful for VNN COMP 2021. Together with @tcwangshiqi-columbia @kaidixu we propose a new set of benchmarks of residual networks (ResNet) on CIFAR-10, which is available here: https://github.com/tcwangshiqi-columbia/cifar10_resnet_benchmark.

Currently, most networks in VNN COMP are feedforward NNs, and many tools are hardcoded to handle feedforward networks only. To make neural network verification more useful in practical scenarios, we advocate that tools should handle more general architectures, and ResNet is the first step towards this goal. We hope this can provide some incentives for the community to develop better tools.

Model details: We provided two small ResNet models on CIFAR-10 with the following structures:

ResNet-2B with 2 residual blocks: 5 convolutional layers + 2 linear layers
ResNet-4B with 4 residual blocks: 9 convolutional layers + 2 linear layers

Since this is one of the first benchmarks using ResNet, we keep the networks relatively small (compared to ResNet 50 used in many vision tasks) so hopefully most tools can handle them: the two networks have only 2 and 4 residual blocks, respectively and they are relatively narrow -- although they are still large compared to most models used in the complete NN verification literature. The networks are trained using adversarial training with L_\infty perturbation epsilon (2/255). We report basic model performance numbers below:

Model	Clean acc.	PGD acc.	CROWN/DeepPoly verified acc.
ResNet-2B	69.25	56.67	26.88
ResNet-4B	67.21	55.25	8.81

Since the models are trained using adversarial training, it also poses a challenge for many verifiers - the CROWN/DeepPoly verified accuracy (as a simple baseline) is much lower than PGD accuracy, and we hope this benchmark can motivate researchers in the community to develop stronger tools that can make this gap smaller.

Data format: The input images should be normalized using mean and std computed from CIFAR-10 training set. The perturbation budget is element-wise, eps=2/255 on unnormalized images and clipped to the [0, 1] range. We provide eval.py as a simple PyTorch example of loading data (e.g., data preprocessing, channel order etc).

Data Selection: We propose to randomly select 100 images from the test set which are classified correctly and cannot be attacked by a 50-step targeted PGD attack with 5 random restarts. For each image, we specify the runner up label (the class with second largest logit) as the target for verification, and verify the property that the logit of runner up label is not larger than that of the groundtruth label within L_\infty perturbation eps=2/255 on input. See instructions below for generating test images with a script, and some example properties are in the vnnlib_properties_pgd_filtered folder.

To keep the sum of all the timeouts less than 6 hours (as suggested by @stanleybak), the suggested per-example timeout is 6 minutes per example (assuming only a fraction of all examples time out).

It is also possible to change the target label for evaluation (e.g., using the target label following the ground truth label, a random target label, or all 9 target labels) - I think it is probably better to align the target label selection mechanism with other benchmarks after they are finalized. We are also willing to split the dataset into easy, medium, hard parts based on results by PGD/CROWN/DeepPoly, if that makes the evaluation easier.

pat676 commented 3 years ago

Thanks, @anwu1219, we have a few questions regarding the benchmarks. We have had some trouble loading the networks in our toolkit. All layers seem to be wrapped in a sequential layer (see onnx graph for the large net attached below); would it be possible to get the file in a format without the sequential layer? Also, what is the purpose of the first transpose node in the network? Regarding the Dropout layers, how are these implement in practice? We are concerned that few toolkits may natively support dropout and this may have a negative effect on widening participation.

Hi @pat676,

Thanks for trying to load the network in your toolkit. The reason for the transpose node in the network is that the networks were originally stored in tensorflow graphdef format (which uses the channel order NHWC) and were transformed to onnx format, whose operators expects NCHW, to conform to the competition format.

I'm not sure that I understand the point about the sequential layer. Attached is a visualization of the architecture of the cifar10_large.onnx network using Netron. I believe the node types in the onnx file are transpose, conv, relu, reshape, matmul and add, which are fairly common operators. Please do let me know if I am missing anything.

Regarding the dropout layer, it was added for certain layers at training time, but it does not play a role at inference time. The onnx files should be agnostic of whether the networks were trained using dropouts or not.

Thank you, your visualization of the net helped a lot, we’ll check out Netron.

Indeed, the problem wasn’t the sequential layer. We don’t currently support transpose ops, so I tried skipping the initial transpose but hadn’t noticed there is one more before the FC layers. However, adding support for transpose is easy enough, so we’re happy with the architecture of the networks.

FlorianJaeckle commented 3 years ago

Sorry for the late response. If it is not too late, we would like to provide an adversarial robustness benchmark with image-dependent perturbation radii (epsilons).

All the adversarial robustness benchmarks posted so far consider only image-independent epsilon values, possibly resulting in some (image, epsilon, network) pairings that are either easily verified by all verification methods, or too hard to be verified (in the usually small timeout) by any of them.

In line with the OVAL verification dataset employed in last year's VNN-COMP, whose versions have been already been used in various published works [1, 2, 3, 4, 5], we propose a benchmark consisting of (image, epsilon, network) pairings that have been found via binary search to ensure that all properties are challenging to solve.

We hope this will address some of the points addressed by @alessiolomuscio and others in the Rules Discussion Issue. A brief description of the dataset follows. We will post a link to the properties in VNNLib format in case there is appetite for this benchmark's adoption.

Models: The 3 CIFAR-10 convolutional networks from last year's OVAL verification dataset. These consist of: a "base" model with 2 convolutional layers followed by 2 fully connected layers, a "wide" model with more hidden units in each layer, and a "deep" model that features more layers (but a similar number of hidden units per layer as the "base" model). All three models are trained robustly using the method provided by Kolter and Wong [6].

Verification properties: We propose a dataset of 300 properties for each model: 100 taken from the dataset that was used last year (consisting only on (image, epsilon, network) pairings verified to be robust), 100 more challenging properties (likely to be robust pairings), and 100 pairings that provably contain adversarial vulnerabilities. In order to avoid overfitting, and to comply with the 6hrs cumulative timeout, we suggest sampling 10 properties for each network at competition time, using a timeout of 720 seconds each. (We note that the timeout is picked arbitrarily and offers a trade-off between difficulty and how many properties you can sample to fit the 6hrs timeout) The pairing generation procedure is as follows: given a correctly classified image and a network, we randomly sample a misclassified label, then find the epsilon value (in infinity norm) via binary searches with predefined timeouts.

pkouvaros commented 3 years ago

Thank you @anwu1219 for the proposed benchmark. In our experience radii of the magnitude proposed are easy to verify for convolutional networks since under such radii there is a significant percentage of ReLUs that are stable following a bound propagation pass. We are wandering whether this is also the case here? If so, in the interest of having a more meaningful evaluation of complete verification (where more branching on the states of the ReLUs needs to be done to solve a verification problem), we are suggesting to increase the radii and timeouts and accordingly reduce the number of images.

Thank you @huanzhang12 for the proposed benchmark. Given the size of the networks and their low verified accuracy with DeepPoly, we are suggesting to increase the timeout and reduce the number of images so as to increase the chances of the tools to resolve a verification query in what appears to be a challenging benchmark. Also we feel that the residual blocks may be problematic for several tools that do not currently support them thereby hindering wide participation.

Overall we now have 10 CIFAR10 CNNs proposed. As we believe using all of these would be an overkill we suggest selecting only three networks: one of small size, one of medium size and one of large size. Also, to promote participation, we suggest giving emphasis to CNNs with simple architectures. Regarding the properties, to avoid overfitting of the methods, we believe it is important that the random inputs are drawn from the whole test-set and that the selection of the verification properties is not based on the efficiency of a single tool to verify them. Finally (if we limit the benchmarks to three CIFAR10 networks) we feel that the properties should be to prove robustness against all other classes, instead of a single class.

Neelanjana314 commented 3 years ago

Sorry to join the party late. If permissible, I (on behalf of my lab: VeriVITAL) would like to propose the following benchmark: https://github.com/Neelanjana314/vnn-comp21-benchmarks

Two MNIST classifiers, one with a maxpooling layer and the other with an average pooling layer.

For both the networks we suggest 25 images (provided in the data folder: https://github.com/Neelanjana314/vnn-comp21-benchmarks/tree/main/Data);

for the average pooling network consider a perturbation radii of 0.02 and 0.04 and a timeout of 5 minutes.
for the max pooling network consider a perturbation radii of 0.004 and a timeout of 7 minutes.

The network expects the input image to be (28x28x1) and normalized between[0,1].

For any queries please feel free to contact.

huanzhang12 commented 3 years ago

@pkouvaros Thank you for your comments on our benchmark. Yes, our benchmark is indeed more challenging than most other proposals, however better tools are also being developed throughout the year - in fact, state-of-the-art verifiers such as Fast-and-Complete, Active Set, Lagrangian decomposition and Beta-CROWN can finish the VNN COMP 2020 benchmark on CIFAR-10 pretty quickly, sometimes just a few seconds per instance. If we just have small and simple networks in the benchmark suite, then it is hard to distinguish the performance differences among of these tools as they all finish in seconds. Thus, we believe introducing a more challenging and future-proof benchmark is a good idea. For this year, if our benchmark is selected, we can certainly reduce the number of test images (e.g., from 100 down to 50 or 20 images) and increase timeout threshold to make it easier.

Also, we agree with you that ResNet can be a challenge for many tools. One purpose of our benchmark is to give tool developers an incentive to handle general network structures instead of restricting to mostly toy problems. We hope the incentive can help the VNN community to grow and make broader impacts outside of the community. I do agree that, to promote participation, simple networks should be included in the benchmark suite, but ideally we should also have a track for challenging networks which pushes the boundary and inspires researchers to invent more efficient and powerful verifiers.

Besides, we agree with you that it is important to use random inputs from the whole test-set rather than selecting properties based on the efficiency of a single tool, which may create bias. So far it seems the benchmark proposals have different ways to create test examples, but I think it is probably better to unify the data selection process across benchmarks.

stanleybak commented 3 years ago

My opinion is that good benchmarks should have a mix of easy and difficult instances so that everyone can get some result, while still having hard instances so that new improvements can be evaluated.

In terms of benchmark generation, the goal is to make the specific images be chosen randomly whenever possible, to ensure the methods are generalizable and not only applicable to hand-selected set of benchmark images or categories. This can be done in a number of ways. Some people have python scripts to generate specific vnnlib instances with a command line argument random seed.

Here's another way to do it using DNNV (from David Shriver): localrobustness.dnnp

For this file you can use dnnv to create a series of vnnlib files to check for local robustness:

pip install dnnv
dnnv localrobustness.dnnp --network N cifar10_2_255.onnx --prop.x_i=3 --convert --convert.to=vnnlib --convert.dest=/tmp/outputdir/

This uses i=3 (image #3 from the CIFAR dataset) to create 9 local robustness vnnlin specification files and onnx networks (one for each potential misclassification category), based on these networks from last year.

Also: we've extended the benchmark submission phase a few weeks. The plan is to let everyone do some initial evaluation on the benchmarks and then we'll do a large group meeting with everyone to discuss the benchmarks near the end of the month or early next month.

pat676 commented 3 years ago

I agree with all of @pkouvaros comments, as well as @stanleybak on generating properties from random images. However, it is my opinion that we should prioritise a single vnnlib-property for checking robustness against all other classes, not 9 properties where each property is for a single adversarial class. This seems to be more in line with most practical use-cases where we want to determine robustness against all other classes and toolkits optimised for checking against multiple classes at once should be rewarded in this competition.

@huanzhang12 we have not seen any indication that SoTA toolkits can solve all cases for the CIFAR10 GGN networks from VNN-COMP’ 20 within the timeout; on the contrary it seems like most public SoTA methods have a good balance of difficulty for these benchmarks. I’m not against more challenging benchmarks in general but I agree with @pkouvaros that the complexity of the architectures should be kept to a minimum. I don’t think it is realistic to expect developers to significantly extend the supported architectures in the relatively limited timeframe we have now; if incentivising developers to extend the supported architectures is a goal of VNN-COMP, then this and the architectures under consideration should be made clear at an earlier stage.

@Neelanjana314, related to the comment above, several toolkits do not seem to support pooling layers; If we want to have pooling I would at least suggest limiting it to average-pooling which is theoretically simpler than max-pooling.

wu-haoze commented 3 years ago

@pkouvaros Thanks for the comments on our proposed benchmarks. Experimentally, we found the proposed radii yield benchmarks with a mixed levels of difficulties. Indeed some can be solved fairly quickly without the need of branching, but some does require that.

stanleybak commented 3 years ago

However, it is my opinion that we should prioritise a single vnnlib-property for checking robustness against all other classes, not 9 properties where each property is for a single adversarial class.

This is fine for me, but this will mean the VNN-LIB properties will have disjunctions. Do people want to add support for that?

mnmueller commented 3 years ago

@stanleybak I agree with @pat676 comments that proving robustness against all other classes seems to be a practically more relevant use-case and allows tools that can reuse intermediate results to show this strength. Further I think this should only lead to conjunctions of the form \wedge_i (y_j > y_i ) \forall i \neq j which could be translated to a set of n-1 properties without either conjunctions or disjunctions for tools that do not support conjunctions.

Would the other participants be open for benchmarks using different, non-piecewise linear activation functions such as tanh or sigmoid?

I agree with @huanzhang12 that it will be valuable to also have slightly more complex network architectures such as ResNets in the benchmark mix as they have great practical importance and will provide a great opportunity for the comparison of faster (incomplete) methods.

stanleybak commented 3 years ago

I think this should only lead to conjunctions of the form \wedge_i (y_j > y_i ) \forall i \neq j

This property would be SAT if we can find any inputs where the classification comes out as j, which is not quite what we want. The violations occurs if any of the other classes is greatest, other than j, which is why (I think) we need to either support disjucntion directly, or break it up into 9 specifications

mnmueller commented 3 years ago

@stanleybak I see this does indeed depend on whether we view the specification as the to be verified property (prove it holds for all permissible inputs) or as a description of a counter-example (prove there is no satisfying example). My intent was in line with the former. After checking the vnnlib standard again, I realized that it is in line with your interpretation.

However, looking at the .vnnlib specifications generated by some of the other participants (e.g., @pat676 (https://github.com/pat676/vnn-comp21-benchmarks/blob/main/cifar-conv_2_255/vnnlib_properties/prop_0_eps_0.008.vnnlib) and @huanzhang12 (https://github.com/tcwangshiqi-columbia/cifar10_resnet_benchmark/blob/main/vnnlib_properties/prop_0_eps_0.008.vnnlib)) I think I was not alone in this (mis-)understanding. Therefore, I think it might be important that we clarify this.

Regardless of which notation we chose, I think we should allow for specifications of this type.

pat676 commented 3 years ago

@stanleybak @mnmueller thanks, I had indeed misinterpreted the vnnlib format. I agree that since the vnnlib format encodes the requirements of a counterexample (as opposed to the requirements of safety/robustness) we have to support disjunctions in the vnnlib format to encode this correctly. However, I don't think this should be problematic, especially if we stick to the special cases of maximal (and possibly minimal?) outputs.

From a toolkit perspective, it is my experience that most (all?) toolkits already support such disjunctions for classification problems, especially since most benchmarks in last years vnn-comp used exactly this specification. If any toolkits do not support it, they can always solve each class individually as @mnmueller commented.

huanzhang12 commented 3 years ago

@stanleybak @mnmueller Apologize for misinterpreting the vnnlib format. I agree with @mnmueller @pat676 that proving robustness against all other classes are more related to practical problems. To avoid the potential issues with disjunctions and make properties easier/faster to handle, an approximation is to just use the runner-up class (the class with second-largest logit and smallest margin to the top-1 class) as the target label, which is often the worst case. It probably makes more sense than selecting a random target class. Actually, I am okay with any target label selection scheme (random, runner-up or all targets), but I think it would be better to make it consistent across benchmarks if possible.

stanleybak commented 3 years ago

We're setting up a meeting the week of May 31st to discuss the benchmarks and other aspects of the competition this year. if you haven't gotten our e-mail on the mailing list, please let me or one of the organizers know.

stanleybak commented 3 years ago

@huanzhang12 Although just using the runner-up class or another single class would be easier to encode, there could be strategies that work better if they can directly deal with a disjunction. We could directly present the disjunction, and then have the option of converting it to 9 independent queries if the tool doesn't support disjunction. This might be slightly more work for the organizers though.

The best option is if everyone adds support for disjunctions, although I don't know if this is realistic. We can discuss it during the meeting. I think @changliuliu's benchmark could be converted from a bunch of if-then-else statements to disjunctions, if we allow for constraints over both the input and output variables.

mnmueller commented 3 years ago

I would like to propose the following benchmark:

1) A ReLU FFN with 8 hidden layers of 200 neurons each for MNIST, that was normally trained with eps=0.015. 36 samples and a 5min timeout per sample, with the property being robustness against all other classes. 2) A Sigmoid FFN with 6 hidden layers of 200 neurons each for MNIST, that was normally trained with eps=0.012. 36 samples and a 5min timeout per sample with the property being robustness against all other classes.

Both networks and corresponding specs are provided here: https://github.com/mnmueller/vnncomp2021_eran_benchmarks Also included is a script to generate new (potentially random) specs and one to load and evaluate the networks.

@stanleybak it looks like the meeting time around Tuesday evening (CET) has the highest availability. Do we want to fix a time?

stanleybak commented 3 years ago

@mnmueller great thanks. The meeting time is Tuesday at 12pm (noon) to 1pm Eastern US time. The zoom link should have been sent to the mailing list.

REST-XD commented 3 years ago

If permission, we would like to propose a benchmark of properties related to data augmentation.

Counterexamples of adversarial robustness are like inputs augmented using random noise on the original data. We could design the properties whose counterexamples are:

inputs augmented using random noise on the flipped images (Horizontal flip),
inputs augmented using random noise on the padding area of a zoomed image (Fixed zoom-in), mainly MNIST images of constant value background,
inputs augmented using random noise on the original data, where the element-wise noises are in [delta_min, delta_max] instead of [-delta, delta] (Brightness).

Those augmentations could be easily expressed. Experiments of horizontal flipping robustness took similar time comparing with that of adversarial robustness. We feel that those properties are worthy of verification, especially the brightness.

@stanleybak may I join the discussion? Very lucky to see this channel before the discussion started.

stanleybak commented 3 years ago

@REST-XD looks interesting. Do you have specific networks for this in .onnx format and can you provide example .vnnlib files? You can see some of the earlier discussion for examples. We're planning to have a discussion about this on Tuesday. Send me a message if you're not on the mailing list and I can get you the meeting info.

REST-XD commented 3 years ago

@stanleybak I think these properties could be verified on networks where the adversarial robustness is verified. I'll try to provide some examples on VNN-COMP 2020 networks and images. Seems I'm not on the mailing list, please let me join.

dlshriver commented 3 years ago

In the discussion today, the topic of simplifying networks came up. DNNV has support for some network simplifications, such as transforming a Shape, Gather, Unsqueeze, Concat, Reshape pattern of operations to a single Flatten operator. Some of the other simplifications that are performed are listed here. To simplify a network with DNNV, use the --convert option on a network and property. The property can be any property for the network, but is currently a required argument. For example:

$ pip install dnnv
$ dnnv localrobustness.dnnp --network N cifar10_2_255.onnx --prop.x_i=3 --convert --convert.to=vnnlib --convert.dest=/path/to/output/

If the property is non-trivial, then the simplified network will be saved to /path/to/output/property0.onnx. DNNV produces one network for each disjunct, they should all be equivalent for most properties unless the property contains disjuncts that operate on different networks or on subnetworks. For the benchmarks proposed here, I don't believe this is ever the case.

This simplification can be done using vnnlib formatted specifications as well (instead of DNNP specifications), as follows:

$ pip install dnnv
$ dnnv localrobustness.vnnlib --network N cifar10_2_255.onnx --vnnlib --prop.x_i=3 --convert --convert.to=vnnlib --convert.dest=/path/to/output/

These conversion will also create 1 property for each disjunct so it could be used to split up vnnlib files that use or and and into a file for each of the corresponding disjuncts for tools that do not yet support and and or in vnnlib. This requires the latest version of DNNV (v0.4.6) which can be installed via pip.

pat676 commented 3 years ago

Hi all,

Following yesterday's discussion, we will remove the CIFAR10 Conv networks (used in last year's competition) as well as the Sigmoid and Tanh networks from our proposal to limit ourselves to one benchmark. We keep the MNIST-FC networks as few FC networks have been proposed and we feel it is important to have some FC networks with varying depths.

yodarocks1 commented 3 years ago

Three questions:

Based upon the simple_adversarial_generator's vnncomp_scripts, I see that we can pass a non-zero exit code to skip a benchmark category. My group's tool specializes in certain types of adversarial examples (those that are "safe" when output n is [1] minimal or [2] maximal), and cannot handle others. So, based upon the VNNLIB file, should we pass a non-zero exit code in these cases, or write an output of Unknown or Error?
Should parsing of the VNNLIB file occur in the prepare_instance or run_instance scripts? (I.e. does it count towards the timeout?)
Based on the ACASXU benchmark, it seems as if the VNNLIB file is defining what is a valid adversarial example. However, in many of the other proposed benchmarks, the VNNLIBs define the valid region. Which is the proper interpretation?

I also noticed that the parser used by simple_adversarial_generator can only handle single-line, non-nested assertions (e.g. (assert (>= Y_0 Y_1)) but not (assert (or (...) (and (...) (...)) (...)))), which (though I may be mistaken) could be confusing to those creating benchmarks and testing if their VNNLIB files are "up to spec."

stanleybak commented 3 years ago

@yodarocks1 responses:

when you submit your tool you can provide the list of benchmarks you want to run on in the tool's README. If there are specific instances you want to skip, it's okay to either return a non-zero exit code or output "unknown"
you can do it in either one. for example, if you want to convert vnnlib to something your tool supports, that can be done in prepare_instance. Hopefully parsing is quick and runtimes are more than a few seconds. Only run_instance will be measured and reported (although don't make prepare_instance take too long... maybe we should check for this if it takes more than a few seconds)
the proper interpretation is that the vnnlib file defines an adversarial example. Some of the benchmarks mistakenly flipped this. This is consistent with way SMTLIB is defined.
I'm actively working on updating the simple_adversarial_example based on the discussion to support a restricted set of disjunctions. I'll send an update in a few days once I think it's working so people can test their files (as well as use the parser if they want).

pat676 commented 3 years ago

We had a look at all benchmarks now and have two questions.

1) The largest network atm has only 9696 ReLU nodes, this is in contrast to last year's competition where the two Cifar Networks had ~50k and ~16k ReLU nodes. We consider this to be unfortuante as one of the main goals of the competition is to promote scalability. If everyone agrees and no one else has any larger Cifar networks to propose, we would be happy to create the necessary vnnlib files for last year's cifar networks. However, this would mean that we propose two benchmarks this year.

2) For all benchmarks, it may be a good idea for the authors to provide some statistics which can be used to confirm that the network was loaded correctly. (e.g. test-set accuracy for mnist/cifar nets or the expected outptut for a few inputs).

huanzhang12 commented 3 years ago

I agree with @pat676 that we should have networks at least as large as last year's benchmark. I do believe a main goal of the competition is to promote scalability - I am okay with including the last year's large CIFAR networks. In fact, I feel it's actually better to further scale them up a little bit in this year (e.g., increase width by 2x or adding 1 or 2 more layers - to demonstrate the improvements in tool scalability over years).

Our CIFAR-10 ResNet networks are relatively large and deep, but they won't be supported by all tools so a few large feedforward networks are indeed needed.

And yes, I also agree with @pat676 that providing clean accuracy numbers and expected outputs for each benchmark would be very helpful.

alessiolomuscio commented 3 years ago

Hi All

Sorry to jump in, but I have to say I really sympathise with this point.

I think that as a community we have made enormous scalability gains in the past 3 years or so. We know there's still some way to go. Scalability is what stops us being able to apply our methods to (some) industry-level usecases to get broader visibility of our work. A key objective for the VNN series is, at least from my standpoint, to assess the presnt tech readiness of our methods against these big problems.

I am sure there are many ways of doing this in terms of submissions, benchmarks, etc. But ultimately I think we really ought to have some large models. Some that some tools may be able to verify and some that perhaps no tool can verify, in the hope that some tool some day will be able to tackle them.

So I'd really have at least some models as large as last year, but actually we could consider having some even larger (100k-300k ReLUs?) even if all tools may timeout, at least this year.

Best

-Alessio

stanleybak commented 3 years ago

I agree, this is probably a good idea, at least for a few of the instances. The benchmark submission is a bit decentralized this year, so it's up to the individual people preparing the benchmarks to do this, however.

huanzhang12 commented 3 years ago

@pat676 Do you think it's okay to replace 1 or 2 MNIST models you proposed with 1 or 2 larger models as you mentioned? Thankfully, @mnmueller also has MNIST MLP benchmarks so I feel it's probably okay to replace 1 or 2 MNIST models here, for the sake of demonstrating scalability gains of tools over each year.

@stanleybak @changliuliu @ttj Do you think it is okay for @pat676 to propose 2 "benchmarks" (larger CIFAR + MNIST MLP) here as long as the total number of models from his team is still restricted? It's actually not clear what "benchmark" means here - next year we probably want to refine the rule to restrict the maximum number of models each team can propose (e.g., 3 models) but not the number of "benchmarks" (e.g., the 3 models can contain both MNIST and CIFAR models, and can be in different benchmark tracks).

stanleybak commented 3 years ago

@huanzhang12 Inside a benchmark you can propose any number of onnx files, so different types is okay if that's what people want to do.

@yodarocks1 and others: I've updated simple adversarial generator to include parsing code for simple disjunctions in the vnnlib files as discussed. Things are assumed to be in disjunctive-normal form (the and is not optional) to simplify parsing. If you want to steal the parser for your code you can do that by copying vnnlib.py and using the read_vnnlib_simple function.

In terms of a spec of what type of statements are allowed, I'm using the following regular expressions for parsing (after removing unnecessary whitespace):

    # example: "(declare-const X_0 Real)"
    regex_declare = re.compile(r"^\(declare-const (X|Y)_(\S+) Real\)$")

    # comparison sub-expression
    # example: "(<= Y_0 Y_1)" or "(<= Y_0 10.5)"
    comparison_str = r"\((<=|>=) (\S+) (\S+)\)"

    # example: "(and (<= Y_0 Y_2)(<= Y_1 Y_2))"
    dnf_clause_str = r"\(and (" + comparison_str + r")+\)"

    # example: "(assert (<= Y_0 Y_1))"
    regex_simple_assert = re.compile(r"^\(assert " + comparison_str + r"\)$")

    # disjunctive-normal-form
    # (assert (or (and (<= Y_3 Y_0)(<= Y_3 Y_1)(<= Y_3 Y_2))(and (<= Y_4 Y_0)(<= Y_4 Y_1)(<= Y_4 Y_2))))
    regex_dnf = re.compile(r"^\(assert \(or (" + dnf_clause_str + r")+\)\)$")

Benchmark proposers: you should now be able to use simple_adversarial_generator to test your vnnlib files for compliance. Also: if someone could test a .onnx and .vnnlib file with multiple channels I would appreciate it. I want to check that I'm doing to channel orders correctly.

huanzhang12 commented 3 years ago

@stanleybak Thanks for confirming this - so I think it is clear that @pat676 can include a mix of MNIST and CIFAR models in his benchmark proposal, and this will not violate the rule, correct?

@stanleybak To be more clear, let me rephrase my question on the number of onnx models. For ACASXu I know there are many onnx models and each one is associated with only a few properties. But for MNIST/CIFAR benchmarks it's different - typically there are only very few models (e.g., we have at most 3 onnx models per team so far) but each model has a large number of properties (images) to test. On these classification models, to avoid large variance in random test set selection, we do need at least 20 - 50 test images per model. So in the case of classification models we don't want one team to propose an unlimited number of onnx models in one benchmark and dominate the scoring (this is not happening this year, but might be good for the next year's rule). Do you think that makes sense?

mnmueller commented 3 years ago

@stanleybak Thank you for providing the simple_adversarial_generator. Upon testing it, I saw that you used .reshape(<shape>, order="F") which leads to samples that were flattened using torch.flatten() or equivalently np.reshape(-1) not being reconstructed correctly.

I suggest that we use the index ordering from the 1d vector obtained by torch.flatten() or np.reshape(-1) when the input is in standard torch (nchw) format. This corresponds to the indices in the original shape as obtained by np.arange(<input_dim>).reshape(<input_shape>), or equivalently idx = c*n_x*n_y + y*n_x + x (where c,x,y is the channel and x-, y-coordinate of the current input and n_x and n_y are the width and height of the image, respectively.

This should agree with the code you provided, if one changes the order in the reshape command to .reshape(<shape>, order="C") (or equivalently not provide an order, as "C" is the default).

pat676 commented 3 years ago

Thank you for the comments. We’ve had a look at the benchmarks, and given the limited timeout we don’t see how we can meaningfully merge the MNIST FC and large CIFAR benchmarks (even if we only propose the CIFAR networks we would have to reduce the number of instances to ~1/3 compared to last year to satisfy the maximum timeout requirement). We also don’t consider removing any of the MNIST FC networks a good solution as we have very few proposed FC networks.

We thus won’t be able to propose the larger CIFAR networks; however, we would encourage other groups to consider this. We do currently have a large amount of small CIFAR Conv networks (10 with < 10k nodes), maybe some of them could be switched for larger networks?

stanleybak commented 3 years ago

@huanzhang12 yes it's okay to include a mix. I added a note of your concern for people proposing lots of networks for the discussion for next year's rules

@pat676 I also wrote this down for next year, where we might consider increasing the timeouts in certain cases to better evaluate scalability improvements

@mnmueller thanks for the feedback. I updated the default order in simple_adversarial_generator to be order="C" flattening. I didn't see anything in the VNNLIB standard about channel ordering (unless I missed it), so we have to make sure everyone has made the same assumptions. A simple sanity check people could include is one benchmark with a tiny epsilon and the counterexample spec is that it classifies as expected. Then the simple_adversarial_generator should be able to find counterexamples if the channel ordering is encoded as expected.

huanzhang12 commented 3 years ago

@pat676 Given there are only a few days left before all benchmarks should be finalized, I am not sure if any other team wants to re-propose a larger network, and in fact, there are so far only 2 CIFAR benchmarks (from @anwu1219 and @FlorianJaeckle) in the feedforward track. Considering the time frame, how about just directly reusing last year's large CIFAR-10 benchmark? Since it is inherited from last year, I think it does not need to count towards any team's proposals, and this also gives us the opportunity to directly compare against last year's results to get a clear picture on how tools are improving over the last year. @stanleybak What do you think?

stanleybak commented 3 years ago

@huanzhang12 I like the idea of seeing year-to-year progress, although comparison from last year would be imperfect since different hardware was used. Someone would still need to take ownership of this benchmark and create an <shortname>_instances.csv file and .vnnlib specifications for each benchmark (and maybe an optional random image picker script). If someone does this, perhaps we could include the results in the report as an unscored benchmark (not affecting ranking), just for fairness since it wasn't in the rules. Would that be reasonable?

stanleybak commented 3 years ago

Some people have started to issue pull requests for their benchmarks, which you can see in the repo. If you think your benchmark is ready, you can test it with simple_adversarial_generator by cloning the rep, adding your benchmark to the benchmarks folder and running ./run_in_docker <short_name> and seeing if there are any errors or if an out.csv is produced.

mnmueller commented 3 years ago

@stanleybak @huanzhang12 I would be willing to provide a script generating samples and instance.csv files for either the cifar benchmarks from last year or a slightly larger pgd trained network (4 conv layers ~62k neurons). I do want to note however that neither would give a direct comparison to last year, as the exact samples can have a notable impact on the benchmark difficulty with so few being evaluated.

huanzhang12 commented 3 years ago

@stanleybak Yes I think that totally makes sense. Ideally, we want to find a team to claim and propose this benchmark, so it can still be included for scoring according to the current rule. I am not sure which team has not proposed a benchmark yet, maybe @dlshriver or @yodarocks1?

For someone who is interested in claiming this benchmark, it is the same GGN-CNN benchmark used in last year (page 17 on the report). This is the only benchmark with a large number of neurons to evaluate scalability of tools so ideally we want to include it (all benchmark proposals this year are relatively small so far). I believe @pat676 has already prepared the models and properties in his repository, so there is not too much work left be done here (a script to generate samples and csv is needed). We just need a team to claim the benchmark to follow this year's rules.

@mnmueller Thank you for the help! If no one wants to claim it, then @mnmueller or @pat676 can help us prepare the script generating samples and csv files, but as @stanleybak suggested, for fairness it probably won't count in the score.

stanleybak / vnncomp2021

Benchmarks Discussion #2