verivital / vnn-comp

12 stars 8 forks source link

Category Proposal: Convolutional Neural Networks #3

Open ttj opened 4 years ago

ttj commented 4 years ago

Convolutional layers, presumably in networks for classification

Representative benchmark(s): MNIST, CIFAR classifiers

Specifications: robustness

Questions: any further restrictions or should this be broken out further, e.g., based on other layers (batch normalization, max pool, etc.)?

Allow batch normalization?

Allow max pool, avg pool, etc.?

stanleybak commented 4 years ago

Probably some of these should include max pooling, and some could just have convolution layers.

alessiolomuscio commented 4 years ago

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

ttj commented 4 years ago

Probably some of these should include max pooling, and some could just have convolution layers.

Agreed, I think based on the feedback so far, this category is going to make sense to break down further into subcategories.

ttj commented 4 years ago

I think we should specify the activation function here. I presume we intend these to be plain RELUs.

Batch normalisation: I do not see this making a difference in terms of verification; so I would leave this out.

Max pooling: should this be a subcategory?

Bencharks: maybe also something in between MNIST and CIFAR?

Thanks for this feedback. Yes, I'm imaging some ReLU layers given they show up frequently in CNNs, but also some explicit convolutional layers.

While batch normalization is easy to handle for verification (as it's linear/affine after training), many tools do not support it, and it can (at least in our experience) drastically influence training, hence showing up in many real networks. As it's probably not widely supported though, I think this would be a subcategory, unless networks can be transformed equivalently/soundly to e.g. fully connected to handle these.

To make this clearer, refer e.g. to these network for MNIST that we support all the layers for natively in NNV (convolutional layers, batch normalization layers, max pool layers, relu layers, fully connected layers, and softmax layers).

mnist_examples

I imagine though various other methods/tools have restrictions that may prevent analysis of this, at least without transformation. If transformation is needed, that potentially is okay, as I am aware that some methods convert some of these layers to fully connected ones to handle them. We'd want to be sure of soundness of that transformation though, but I presume that would be a meta-theorem in the papers on the tool methods if such transformation is needed.

Questions:

1) based on your piecewise/ReLU category feedback, I am inferring you have some classification networks that only have ReLUs, is that accurate?

2) We're certainly open to other networks/data sets for benchmarks, do you have any data set in mind? Maybe GTSRB?

souradeep-111 commented 4 years ago

This is to follow up on Taylor's request. I would like to sign up for this category with Sherlock. https://github.com/souradeep-111/sherlock

stanleybak commented 4 years ago

I'd like the nnenum tool to participate in this category, although I probably won't add support for pooling layers.

vtjeng commented 4 years ago

I'd like to enter the MIPVerify tool for this category as it is currently described.

More details on the tool as described in my other comment:

GgnDpSngh commented 4 years ago

I would like to enter our tool ERAN in this category: https://github.com/eth-sri/eran

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Cheers, Gagandeep Singh

ttj commented 4 years ago

For convolutional networks, besides intensity-based robustness, it is highly desirable to also have robustness against geometric transformations. Are we going to include this or only norm-based robustness?

Thanks for the feedback. While I think not all methods may support this (see e.g. @vtjeng comment above), I think it would be great to include such perturbations and see which ones can handle them. Could you please share some details on how the transformations are specified so we can start to think about how to make this interoperable between methods? I think probably the most extensible way to ease comparison will be in terms of the input sets, but realize this may not be ideal vs. specifying say a rotation/translation/etc by some amount.

FlorianJaeckle commented 4 years ago

The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof.

Cheers, Florian Jaeckle

ttj commented 4 years ago

CNN Category Participants:

Oval ERAN MIPVerify nnenum NNV Sherlock

If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15).

joshua-smith4 commented 4 years ago

We would like to enter the ARFramework tool (https://github.com/formal-verification-research/ARFramework) into this category. I apologize for the late response. Finishing out this semester has been a little hectic. I hope this does not cause too much inconvenience.

On the subject of formatting, the ARFramework tool operates on Tensorflow graphs exported as google protocol buffers. If we can work out a way to translate the benchmark networks to protocol buffers, this would greatly facilitate our participation. Also, because the ARFramework tool makes use of a modified Fast Gradient Sign Method to generate adversarial examples, access to the gradient of the network with respect to the input is necessary. I would be happy to participate in the network formatting development in order to ensure that this tool and all others have their input requirements met.

Also, I was wondering how we are planning to specify the robustness property. Would someone kindly post a formal specification of the property we will be verifying so that we can check compatibility with our tool? That would be greatly appreciated. Thanks, Josh Smith

pat676 commented 4 years ago

Hi,

We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/.

The toolkit supports:

Regards, Patrick Henriksen

ttj commented 4 years ago

Finalized CNN Category Participants:

Oval ERAN MIPVerify nnenum NNV Sherlock VeriNet ARFramework Venus

I will add some discussion on the benchmark plans soon. As indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. However, by default, we will provide some CNNs that we've analyzed with NNV, and likewise anyone else may propose to provide ones they've analyzed. As there is a lot of parameterization here (on the network architecture, the data set, the specification, etc.), those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools.

GgnDpSngh commented 4 years ago

Hi there,

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow. We would propose the following networks and specifications: Networks

  1. Networks trained with provable defenses: We could consider state-of-the-art provably trained networks here that are also challenging to prove, i.e., cannot be proved by simple interval analysis, e.g. from https://openreview.net/pdf?id=SJxSDxrKDr. This way the state-of-the-art also benefits from potential better verification results.
  2. Networks trained with empirical defense: We can use state-of-the-art networks trained with PGD etc. We can provide PGD trained ResNets used in https://files.sri.inf.ethz.ch/website/papers/neurips19-deepg.pdf from our side.
  3. Networks trained normally that achieve state-of-the-art accuracy for MNIST, CIFAR10, Imagenet etc. Besides the MNIST networks that you proposed, we can also provide our normally trained networks from https://github.com/eth-sri/eran in ONNX format.
    Specifications based on robustness, i.e., the correct class has a higher score than all other labels: L_oo-norm: For normally and PGD trained CIFAR10 networks, meaningful epsilon values have not been proven yet. We can consider provable networks from (1). Brightness: We can consider this type of specification for PGD and normally trained networks as these are relatively easier to prove than L_oo on networks from (2) and (3). Geometric: We can use the Polyhedra regions for geometric perturbations, e.g., rotation, translation available via https://github.com/eth-sri/deepg for all types of networks.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Cheers, Gagandeep Singh

ttj commented 4 years ago

Thanks for the feedback! Responses inline.

It has been the case that different tools come up with their own neural network format and this makes it very difficult to reuse those networks in other tools. I think all networks should be provided in ONNX format as it provides a common format for networks trained in different frameworks e.g., PyTorch and Tensorflow.

Yes, the plan is to use ONNX for the networks. For the other parts (specs in particular, to some degree data sets), this will be harder. Part of the goal of VNN-COMP is to generate feedback as VNN-LIB gets created:

http://www.vnnlib.org/

Some participants (@dlshriver) have developed a translation tool from/to ONNX for various tools:

https://github.com/dlshriver/DNNV

This has some other features, such as a DSL for specifications, etc. that we can explore whether it's possible to use.

Thanks for the model pointers, I will look at them and encourage the other participants to also do so. I think given time constraints it won't be realistic to consider everything, so we will need to identify some reasonable subset, and all participants in the category should agree on them.

We should have two types of verification tasks: complete and incomplete verification. For complete verifiers, we can compare speeds on small provably and non-provably trained networks. We can use the ground-truth from complete verifiers to test the soundness of participating tools so that one does not win the competition by returning “Verified” always :). We can disqualify a tool if it has too many unsound results. For incomplete verifiers, we can compare precision and speed by comparing the number of instances proved within a given time limit.

Earlier we discussed a different form of categorization based on completeness, but that did not lead to a lot of discussion. I agree the distinction is important, but would suggest how this shows up will be later on as results are aggregated into a report. With respect to the benchmarks, I think how this is important at this stage is just to ensure there is a sufficient diversity of smaller through larger instances so that there are benchmarks on which the complete methods may finish in reasonable time. We could consider a subdivision of the benchmarks based on this, but I think it is premature and that the benchmarks could be attempted by both complete and incomplete, where complete of course will hit scalability boundaries (which is one of the things we aim to identify by doing this initiative).

Regarding the competition aspect, as this is a first instance to get things starting and we're not planning to create a formalized ranking (so really more of a challenge vs. competition), so we can further discuss how these things would show up. I do not think we should have any plans to disqualify any participants, as a goal of this is to foster the community, etc. Of course ensuring the results of participants on benchmarks are valid is important, but for some of this, I think some basic testing approaches for checking ground truth may be sufficient, particularly for those instances where complete methods will not scale. For reference, how this is typically done in related initiatives (e.g., SV-COMP/SMT-COMP) is there is a large penalty for having a soundness bug (e.g., say a score of +1 for each instance correctly proven within a time limit, a score of 0 or -1 for not returning a result in time, and a score of -100 if there's a soundness bug, such as the tool returns something is verified when it is false).

I don't think we should have such a detailed scoring system this year (for a plethora of reasons, such as common infrastructure on which to run so comparisons are completely fair, benchmark standardization, etc.), but we should discuss how to report timeouts, correct vs. incorrect results, etc. from the tools, and use this discussion to guide how future iterations may incorporate the tool results into a scoring system.

pkouvaros commented 4 years ago

Hi

We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category.

Best

Panagiotis

alessiolomuscio commented 4 years ago

I agree that we want to run this in a friendly manner to help the community rather than proclaiming a winner. If, for example, we come out of this with a preferred input format, and have raised the profile of the topic, I think these alone will have made the effort worthwhile.

Having said this, I think we need to differentiate clearly complete vs incomplete methods. The issue is orthogonal to the categories (challenge problems) themselves. But this comes with advantages and disadvantages and these ought to be somewhat clear to a potentially interested party trying to understand the results. Maybe we ought to consider grouping tools together as complete or incomplete in the reporting phase.

vtjeng commented 4 years ago

On the topic of making comparisons within a time limit, I think it would be valuable to have all verifiers (complete and incomplete) report the amount of time required to verify each property of interest (with a reasonable overall maximum time to spend on each property).

This would allow us to:

  1. Compare the number of properties verified at a range of time limits (perhaps there are tools that are faster for simpler properties, and other tools that are faster for more complex properties)
  2. Identify heterogeneity in verifier performance (perhaps a particular verifier is better at verifying certain classes of properties).

I agree with @alessiolomuscio's suggestion of the importance of clearly differentiating between complete and incomplete methods. One way we might be able to do so is to differentiate in our final report between cases where we timed out (could happen to both complete and incomplete methods) and cases where we the algorithm runs to completion and reports failure (only happens to incomplete methods).

ttj commented 4 years ago

Thanks for the feedback! Yes, I agree we will include the timing information, and can group in the resulting report the complete/incomplete methods results.

ttj commented 4 years ago

Quick update---we have gone through all the proposed networks and will provide some additional detailed feedback shortly. I think most of the networks can be used, although ideally the proposer of any network will provide an ONNX format version of it soon. We have done some conversion for the proposed networks and will share that soon, but there are a variety of intricacies of ONNX that need to be considered (e.g., precision of weights, additional layers added in conversion, etc.), so we would prefer if anyone proposing a network provides it in ONNX to avoid any unintentional changes as you're the ones most familiar with your own networks. We will do this for our MNIST networks discussed above.

Given our time constraints and needing to get everything finalized and distributed very soon, I would suggest we only focus on MNIST for this iteration (and skip e.g. some of the CIFAR proposals). If anyone feels strongly otherwise, please let us know. This is in part because we also need to finalize the other inputs (e.g., specific images for local robustness, local robustness bounds / specifications, etc.) for a specific verification problem quickly as well.

Additionally, from the proposed networks, I think only a couple tools may support residual networks currently, so I would either suggest excluding those for this iteration, or to focus on them later if we have time / by the participants who support resnets currently (I think only ERAN and MIPverify from the discussion).

We can discuss the submission of results more, but for planning purposes, we're anticipating participants will submit a Dockerfile and a script to execute everything in the category to produce a latex table with runtimes, verification results, etc. If we have time, we will then run these on a common infrastructure, and otherwise will just report the characteristics of the machine used if done by each participant.

vtjeng commented 4 years ago

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

ttj commented 4 years ago

A question about submission formats: MIPVerify can use a variety of back-end solvers, both proprietary and not, including Gurobi. (The performance with Gurobi is significantly faster than that with open-source solvers such as Cbc).

I would like to submit a Dockerfile so that the work is completely reproducible, but don't know whether this can be done when using the Gurobi solver --- does anyone else have a setup involving a Dockerfile that works with Gurobi?

I think others also use Gurobi, so they can likely respond in more detail, but this should be possible just by not including a license, then it can manually be installed by the user. We do similar things with NNV for its Docker-based runs given it uses Matlab. We can discuss this more in depth as we get to that stage, e.g., if the binaries/installation files also aren't available openly, some instructions could be provided for set up. From a quick search (looks a little old):

https://github.com/SG2B/gurobi

GgnDpSngh commented 4 years ago

Hi all,

We are happy to provide 2 MNIST and 2 CIFAR10 Convolutional networks in the ONNX format here:

https://github.com/eth-sri/colt/tree/master/trained_models/onnx

The epsilon values of the L_oo ball for the MNIST networks are 0.1 and 0.3 while those for the CIFAR10 networks are 2/255 and 8/255. The network names contain epsilon values. All networks expect input images to be first normalized to be in the range [0,1]. Next, standard mean and deviations should be applied to the normalized images before passing them to the networks. The mean and standard deviation for MNIST and CIFAR10 can be found here:

https://github.com/eth-sri/colt/blob/acea4093d0eebf84b49a7dd68ca9a28a08f86e67/code/loaders.py#L7

Let me know if anyone has any trouble running the networks.

Cheers, Gagandeep Singh

stanleybak commented 4 years ago

@GgnDpSngh which input images are intended for the networks?

GgnDpSngh commented 4 years ago

The first 100 images of the MNIST and CIFAR10 test set should be fine so that there is a good mix of easy, medium, and hard instances for verification. We provide them here: https://github.com/eth-sri/eran/tree/master/data

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this: https://github.com/MadryLab/cifar10_challenge

ttj commented 4 years ago

We can filter out images that are not correctly classified. Also, I was wondering if we should have the challenge live throughout the year so that if anyone gets better results they can add their results to the result table. Similar to this: https://github.com/MadryLab/cifar10_challenge

Thanks very much for providing the networks, specs, and inputs!

Regarding the continual challenge aspect: I think that's an interesting idea to consider, let's bring this up at the workshop so we can discuss in more detail how that might work. I think as long as people submit Dockerfiles and scripts to re-generate everything, something like that could be feasible without too much effort. Part of what we'll discuss in the workshop is how this first initiative went, what changes could be made going forward, etc.

Neelanjana314 commented 4 years ago

The first 100 images of the MNIST and CIFAR10 test set should be fine so that there is a good mix of easy, medium, and hard instances for verification. We provide them here: https://github.com/eth-sri/eran/tree/master/data

Hi @GgnDpSngh thanks for the update on CNN networks. Could you please mention if all the 100 images are correctly classified by the networks or what are the prediction rates for the networks, out of 100?

GgnDpSngh commented 4 years ago

Hi @Neelanjana314,

The number of correctly classified images for the networks are as follows:

mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100

I am attaching the files containing the classification for the images by each network here for everyone's reference. classification_cifar10_2_255.txt classification_cifar10_8_255.txt classification_mnist_0.3.txt classification_mnist_0.1.txt

Cheers, Gagandeep Singh

Neelanjana314 commented 4 years ago

Hi @Neelanjana314,

The number of correctly classified images for the networks are as follows:

mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100

I am attaching the files containing the classification for the images by each network here for everyone's reference. classification_cifar10_2_255.txt classification_cifar10_8_255.txt classification_mnist_0.3.txt classification_mnist_0.1.txt

Cheers, Gagandeep Singh

Thanks @GgnDpSngh

Neelanjana314 commented 4 years ago

Hi @Neelanjana314, The number of correctly classified images for the networks are as follows: mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100 I am attaching the files containing the classification for the images by each network here for everyone's reference. classification_cifar10_2_255.txt classification_cifar10_8_255.txt classification_mnist_0.3.txt classification_mnist_0.1.txt Cheers, Gagandeep Singh

Thanks @GgnDpSngh

hi @GgnDpSngh , The models provided were trained on GPU and I guess, they were converted to onnx networks using GPU as well. If possible, could you please do the conversion without the GPU variables?

Also, could you please mention the onnx, OpsetVersion and Pytorch version, used to create and translate these models?

Thanks Neelanjana

stanleybak commented 4 years ago

Hi @GgnDpSngh,

The model structure looks okay to me except after the convolutional layers right before the fully connected layer there's a strange structure (between nodes 15 and 23 in the image below):

mnist_0 1

Any idea what this is? It looks like some sort of reshaping... but that should only need one node. Is this the GPU variable that @Neelanjana314 was referring to?

FlorianJaeckle commented 4 years ago

Hi all:

Sorry for the late response. We were working on NeurIPS submissions. If it is not too late, we would like to provide 3 CIFAR-10 convolutional networks and verification properties specifically found for these networks.

Models:  The provided models have different network architectures and are of different sizes. Specifically, there is a base model with 2 convolutional layers followed by 2 fully connected layers. The other two are larger models: a wide model that has the same number of layers as the base model but more hidden units in each layer and a deep model that has more layers but a similar number of hidden units in each convolutional layer to the base model. All three models are trained robustly using the method provided by Kolter and Wong [1]. All three models are available in both PyTorch and ONNX format.

Verification properties: We consider the following verification properties: given an image x for which the model correctly predicts the label y, we want to verify that the trained network will not make a mistake by labelling a slightly perturbed image as y’ (y’ \neq y). The label y’ is randomly chosen at the beginning and the allowed perturbation is determined by an epsilon value under infinity norm. The epsilon values are found via binary searches with predefined timeouts. Overall, on the base model, we collected more than 1500 verification properties with three difficulty levels and a one hour timeout. We also collected 300 properties for the wide model and 250 properties for the deep model using 2-hour timeouts. 

The models and the pandas tables with all verification properties can be found at https://github.com/oval-group/GNN_branching/tree/master/onnx_models and https://github.com/oval-group/GNN_branching/tree/master/cifar_exp respectively. These verification datasets have already been used in two published works [2,3].

[1] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. International Conference on Machine Learning, 2018.

[2] Jingyue Lu and M. Pawan Kumar. Neural Network Branching for Neural Network Verification. International Conference on Learning Representations, 2020.

[3] Rudy Bunel and Alessandro De Palma and Alban Desmaison and Krishnamurthy Dvijotham and Pushmeet Kohli and Philip H. S. Torr and M. Pawan Kumar. Lagrangian Decomposition for Neural Network Verification.  Uncertainty in Artificial Intelligence, 2020.

Cheers, Florian Jaeckle (Oval)

GgnDpSngh commented 4 years ago

Hi @GgnDpSngh,

The model structure looks okay to me except after the convolutional layers right before the fully connected layer there's a strange structure (between nodes 15 and 23 in the image below):

mnist_0 1

Any idea what this is? It looks like some sort of reshaping... but that should only need one node. Is this the GPU variable that @Neelanjana314 was referring to?

Hi @stanleybak the reshape node is for flattening the input, that is, converting the [c,h,w] input from the last convolutional layer into c \times h \times w vector for the first fully connected network.

Cheers, Gagandeep Singh

stanleybak commented 4 years ago

the reshape node is for flattening the input, that is, converting the [c,h,w] input from the last convolutional layer into c \times h \times w vector for the first fully connected network.

@GgnDpSngh Why isn't reshaping a single operation? It looks like it's taking 8 nodes or so to do this. See this zoomed in view of what I mean:

middle

GgnDpSngh commented 4 years ago

the reshape node is for flattening the input, that is, converting the [c,h,w] input from the last convolutional layer into c \times h \times w vector for the first fully connected network.

@GgnDpSngh Why isn't reshaping a single operation? It looks like it's taking 8 nodes or so to do this. See this zoomed in view of what I mean:

middle

@stanleybak The code corresponding to this complicated pattern is here: https://github.com/eth-sri/colt/blob/20f30b073558ae80e5e726515998c1f31d48b6c6/code/layers.py#L99

This seems to be inherent feature of conversion from PyTorch models to ONNX. https://github.com/pytorch/pytorch/issues/20453#issue-443611819

Cheers,

stanleybak commented 4 years ago

The number of correctly classified images for the networks are as follows: mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100

I was able to reproduce these results, at least with classification. @GgnDpSngh how should these be analyzed? for example for the cifar10_8 network I first check if the classification is correct (throwing away images that classify incorrectly). If it's correct, then I try to see if it classifies robustly with a an l_infinity norm with epsilon=8/255, and report safe, unsafe, or timeout for each image? What timeout do we use? Is the l_infinity spec the correct one to use (you also mentioned brightening and rotations as possible specs)?

Also @ttj, are the benchmarks finalized now? There was some mention of only using MNIST and not CIFAR. I wouldn't want to spend time analyzing the CIFAR ones if we're not using them. A definitive list of benchmarks would be helpful at this point. @GgnDpSngh posted 4 networks in this category, and @FlorianJaeckle provided some as well. Which ones do we want everyone to analyze?

ttj commented 4 years ago

Also @ttj, are the benchmarks finalized now? There was some mention of only using MNIST and not CIFAR. I wouldn't want to spend time analyzing the CIFAR ones if we're not using them. A definitive list of benchmarks would be helpful at this point. @GgnDpSngh posted 4 networks in this category, and @FlorianJaeckle provided some as well. Which ones do we want everyone to analyze?

Thanks @stanleybak, we'll pull out all the examples into the repository today/tomorrow. I originally had suggested just MNIST as not many CIFAR examples had been provided and in the interest of time. I suggest everyone complete MNIST first, then proceed to CIFAR in case time constraints become an issue.

Neelanjana314 commented 4 years ago

The number of correctly classified images for the networks are as follows: mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100

I was able to reproduce these results, at least with classification.

Hi @stanleybak, @GgnDpSngh could any of you please provide me with the layerwise outputs for the classification network corresponding to mnist_0.1.onnx for the 2nd image of the list. Currently I am having some issues with the importing in matlab.

Thanks Neelanjana

GgnDpSngh commented 4 years ago

The number of correctly classified images for the networks are as follows: mnist_0.1.onnx: 100/100 mnist_0.3.onnx: 99/100 cifar10_2_255.onnx: 82/100 cifar10_8_255.onnx: 56/100

I was able to reproduce these results, at least with classification. @GgnDpSngh how should these be analyzed? for example for the cifar10_8 network I first check if the classification is correct (throwing away images that classify incorrectly). If it's correct, then I try to see if it classifies robustly with a an l_infinity norm with epsilon=8/255, and report safe, unsafe, or timeout for each image? What timeout do we use? Is the l_infinity spec the correct one to use (you also mentioned brightening and rotations as possible specs)?

Also @ttj, are the benchmarks finalized now? There was some mention of only using MNIST and not CIFAR. I wouldn't want to spend time analyzing the CIFAR ones if we're not using them. A definitive list of benchmarks would be helpful at this point. @GgnDpSngh posted 4 networks in this category, and @FlorianJaeckle provided some as well. Which ones do we want everyone to analyze?

Hi @stanleybak , yes for the analysis, we should discard the incorrectly classified images and only run the analysis for the L_oo-norm-based region for the networks. Since there are many images, I would suggest having a timeout between 1-5 minutes depending on the size of the network.

Cheers, Gagandeep Singh

stanleybak commented 4 years ago

Hi @stanleybak, @GgnDpSngh could any of you please provide me with the layerwise outputs for the classification network corresponding to mnist_0.1.onnx for the 2nd image of the list. Currently I am having some issues with the importing in matlab.

@Neelanjana314 I've attached a file with the intermediate values for the network. The node labels are the ones in the onnx graph:

mnist_0 1

intermediate_outputs.txt

Neelanjana314 commented 4 years ago

Hi @stanleybak, @GgnDpSngh could any of you please provide me with the layerwise outputs for the classification network corresponding to mnist_0.1.onnx for the 2nd image of the list. Currently I am having some issues with the importing in matlab.

@Neelanjana314 I've attached a file with the intermediate values for the network. The node labels are the ones in the onnx graph:

mnist_0 1

intermediate_outputs.txt

Thanks @stanleybak

stanleybak commented 4 years ago

Overall, on the base model, we collected more than 1500 verification properties with three difficulty levels and a one hour timeout. We also collected 300 properties for the wide model and 250 properties for the deep model using 2-hour timeouts.

@FlorianJaeckle If there are 1500 properties with a one hour timeout and 550 properties with a 2 hour timeout, the worst-case runtime is 108 days... is that correct?. I assume many of the properties verify quickly, otherwise this would be unreasonable. Is there an interesting reasonable subset we could expect other people to try? Also, can you explain the property format? I see it's pickled pandas table... what is the meaning of the 17 columns? Where are the images?

Also @ttj , were we going to use @FlorianJaeckle 's benchmarks this time or focus on the other ones given by @GgnDpSngh or any others?

FlorianJaeckle commented 4 years ago

Thanks @stanleybak for your comments! We've created 2 smaller subsets for each of the three models; one with 20, and another with 100 properties containing the first subset (https://github.com/oval-group/GNN_branching). In the interest of saving time, using a timeout of 1 hour for all properties, should suffice. As the majority of properties can be verified fairly quickly and because one can solve different properties in parallel, it should be a lot quicker to run experiments than the theoretical upper bounds you mentioned. The pandas tables now have three columns only: all images are taken from the cifar10 test set and the Idx column refers to the image index; the Eps value defines the epsilon-sized l_infinity ball around the image; and finally the prop value defines the property we are verifying against, as we are doing 1-vs-1 verification. All properties are UNSAT, meaning that the network is robust around each image. Let me know if you have any further questions!

Neelanjana314 commented 4 years ago

Also @ttj , were we going to use @FlorianJaeckle 's benchmarks this time or focus on the other ones given by @GgnDpSngh or any others?

@stanleybak please correct me if I am wrong, as per previous discussions, I think we will primarily focus on MNIST networks and if time permits we can test with CIFAR10 as well.

GgnDpSngh commented 4 years ago

Thanks @stanleybak for your comments! We've created 2 smaller subsets for each of the three models; one with 20, and another with 100 properties containing the first subset (https://github.com/oval-group/GNN_branching). In the interest of saving time, using a timeout of 1 hour for all properties, should suffice. As the majority of properties can be verified fairly quickly and because one can solve different properties in parallel, it should be a lot quicker to run experiments than the theoretical upper bounds you mentioned. The pandas tables now have three columns only: all images are taken from the cifar20 test set and the Idx column refers to the image index; the Eps value defines the epsilon-sized l_infinity ball around the image; and finally the prop value defines the property we are verifying against, as we are doing 1-vs-1 verification. All properties are UNSAT, meaning that the network is robust around each image. Let me know if you have any further questions!

Hi @FlorianJaeckle, I probably missed the earlier conversations, but based on what you wrote, the properties only encode that the score for a random label is not greater than the correct label? In usual robustness setting one verifies against all labels, the considered properties seem weaker, any particular reason why we don't consider verifying the full robustness of these models?

Cheers, Gagandeep Singh

pat676 commented 4 years ago

Hi @FlorianJaeckle, in my opinion, it is essential that we also provide SAT cases, not just UNSAT. A mix makes a more realistic testing environment and is also more likely to detect bugs in toolkits.

Kind Regards, Patrick

FlorianJaeckle commented 4 years ago

Hi @GgnDpSngh and @pat676 , thank you for your comments and sorry for the late reply! There is no obvious advantage of doing 1 vs 1 verification compared to 1 vs all, but for the sake of comparing verification methods, both should be similarly well suited. Unfortunately, if we change our dataset to 1-vs-all verification then a lot of challenging UNSAT properties will change to very easy SAT ones.

We separated UNSAT and SAT properties as in our papers we mostly wanted to test our lower bounding strategies, but we will add suitable SAT properties for the three models shortly if desired?

harkiratbehl commented 4 years ago

Hi all,

We are happy to provide 2 MNIST and 2 CIFAR10 Convolutional networks in the ONNX format here:

https://github.com/eth-sri/colt/tree/master/trained_models/onnx

The epsilon values of the L_oo ball for the MNIST networks are 0.1 and 0.3 while those for the CIFAR10 networks are 2/255 and 8/255. The network names contain epsilon values. All networks expect input images to be first normalized to be in the range [0,1]. Next, standard mean and deviations should be applied to the normalized images before passing them to the networks. The mean and standard deviation for MNIST and CIFAR10 can be found here:

https://github.com/eth-sri/colt/blob/acea4093d0eebf84b49a7dd68ca9a28a08f86e67/code/loaders.py#L7

Let me know if anyone has any trouble running the networks.

Cheers, Gagandeep Singh

Hi @GgnDpSngh, it is difficult to import onnx models into Pytorch (https://github.com/pytorch/pytorch/issues/21683). I can see that your original code is in Pytorch. Are the Pytorch models in https://github.com/eth-sri/colt/tree/master/trained_models same as the onnx models, so that we can use the Pytorch models directly?

GgnDpSngh commented 4 years ago

Hi @harkiratbehl , I guess the @ttj can reply to this better, I am not sure if non-onnx format is allowed atm?

Cheers, Gagandeep Singh