verivital / vnn-comp

12 stars 8 forks source link

Category Proposal: Piecewise Linear Activations (ReLU) #2

Closed ttj closed 4 years ago

ttj commented 4 years ago

Networks consisting of piecewise linear activations such as ReLUs

Representative benchmark: ACAS-Xu

Benchmark details: see appendix section 6: https://arxiv.org/abs/1702.01135

Specifications: safety (output lies in specific range based on a set of inputs)

Questions: keep to only ReLUs, or allow other linear and piecewise linear activations?

stanleybak commented 4 years ago

Should the specifications be a single set? Some specs like "Clear of conflict is minimal" requires encoding as a disjunction (unsafe if left > clear or right > clear, ect.), which I'm not sure all tools will support. Perhaps it makes sense to provide both if not all tools can handle disjunctions.

alessiolomuscio commented 4 years ago

Benchmarks: ACAS is low dimensional. We also need some high dimensional benchmarks.

Activations: I think just ReLUs is sufficient.

Specifications: safety is good but I think we ought to have local robustness here as well.

ttj commented 4 years ago

Should the specifications be a single set? Some specs like "Clear of conflict is minimal" requires encoding as a disjunction (unsafe if left > clear or right > clear, ect.), which I'm not sure all tools will support. Perhaps it makes sense to provide both if not all tools can handle disjunctions.

Thanks for the feedback, yes, we will need to handle this, e.g., by providing a list of the sets in the disjunction, then tools that don't support disjuncts can check each of these in e.g. multiple calls, whereas tools that do internally support disjuncts can handle them however they do.

ttj commented 4 years ago

Benchmarks: ACAS is low dimensional. We also need some high dimensional benchmarks.

Activations: I think just ReLUs is sufficient.

Specifications: safety is good but I think we ought to have local robustness here as well.

Thanks for the feedback. While I agree the ACAS networks are small, if one considers all the properties, the full range of input sets, and all the networks, they are nontrivial to analyze (e.g., I think at least still hours of analysis for state-of-the-art).

We're certainly welcome to other benchmarks if there are suggestions for larger ones.

What I was thinking (but welcome to revision) is the CNN category would cover the larger networks and robustness specifications (and explicitly covering convolutional layers), but if there are networks with only ReLUs for MNIST/CIFAR/etc., we could probably add those as benchmarks for this category if that is what you're thinking.

stanleybak commented 4 years ago

Yes, I think there are some MNIST networks with RELU / fully connected layers. Local robustness could make sense here.

alessiolomuscio commented 4 years ago

Benchmarks: ACAS is low dimensional. We also need some high dimensional benchmarks. Activations: I think just ReLUs is sufficient. Specifications: safety is good but I think we ought to have local robustness here as well.

Thanks for the feedback. While I agree the ACAS networks are small, if one considers all the properties, the full range of input sets, and all the networks, they are nontrivial to analyze (e.g., I think at least still hours of analysis for state-of-the-art).

We're certainly welcome to other benchmarks if there are suggestions for larger ones.

What I was thinking (but welcome to revision) is the CNN category would cover the larger networks and robustness specifications (and explicitly covering convolutional layers), but if there are networks with only ReLUs for MNIST/CIFAR/etc., we could probably add those as benchmarks for this category if that is what you're thinking.

To my mind the architectures and the activations are orthogonal to the size of the models. Indeed I would have thought we certainly wanted to have a set of models in each category ranging from small/manageable to large/undoable.

In addition I do think we need to cover extensively models with high dimensional inputs, including in a ReLU category. This is what the AI/ML community use and I think we want to reach out to them and show that our benchmarks cover the sort of models they themselves use (at least in terms of architecture, if not quite of size just yet) and not just case studies normally explored in verification.

ttj commented 4 years ago

To my mind the architectures and the activations are orthogonal to the size of the models. Indeed I would have thought we certainly wanted to have a set of models in each category ranging from small/manageable to large/undoable.

In addition I do think we need to cover extensively models with high dimensional inputs, including in a ReLU category. This is what the AI/ML community use and I think we want to reach out to them and show that our benchmarks cover the sort of models they themselves use (at least in terms of architecture, if not quite of size just yet) and not just case studies normally explored in verification.

Yes, these are good points, and I agree. I had not really thought of having classification without the convolutional layers, which is why I primarily had put ACAS-Xu as the representative benchmark, but assuming that's possible / architectures can be provided, then certainly we could have parameterized examples scaling the size/architecture of the networks within this category as well, for e.g. MNIST and potentially other data sets.

With respect to high-dimensional inputs, I presume you are thinking of images that would be covered with e.g. MNIST/other data sets in this category, but please let us know if you have other examples in mind that might make appropriate benchmarks.

alessiolomuscio commented 4 years ago

So maybe we ought to have 2 main categories only: ReLU and sigmoids. The present CNN could sit as a subcategory of the ReLU one. In the ReLU category we could have high (vision, classifiers) and low (Acas, ...) dimensionality inputs, some fully connected, some convolutional, etc.

Some existing tools may not support all the benchmarks, but I think that's fine.

souradeep-111 commented 4 years ago

This is to follow up on Taylor's request. I would like to sign up for this category with Sherlock. https://github.com/souradeep-111/sherlock

ttj commented 4 years ago

So maybe we ought to have 2 main categories only: ReLU and sigmoids. The present CNN could sit as a subcategory of the ReLU one. In the ReLU category we could have high (vision, classifiers) and low (Acas, ...) dimensionality inputs, some fully connected, some convolutional, etc.

Some existing tools may not support all the benchmarks, but I think that's fine.

Thanks for the feedback! Let's see what other feedback comes in as people hopefully start to reply with what categories they're interested to see if this makes sense based on level of participation, or whether keeping them split is ideal.

stanleybak commented 4 years ago

I'd like the nnenum tool to participate in this category. Hopefully we can get a few benchmarks in addition to ACASXu that are higher input dimension, but not just vision ones.

changliuliu commented 4 years ago

So maybe we ought to have 2 main categories only: ReLU and sigmoids. The present CNN could sit as a subcategory of the ReLU one. In the ReLU category we could have high (vision, classifiers) and low (Acas, ...) dimensionality inputs, some fully connected, some convolutional, etc. Some existing tools may not support all the benchmarks, but I think that's fine.

Thanks for the feedback! Let's see what other feedback comes in as people hopefully start to reply with what categories they're interested to see if this makes sense based on level of participation, or whether keeping them split is ideal.

I agree with Alessio's suggestion. If we are going to bring big models (e.g., MNIST) to this category, it makes sense to reduce to two categories regarding network structure: piecewise linear (including ReLU, convolution, max pooling, etc) and smooth nonlinear (sigmoid and tanh). Then orthogonally for each category, we can have big models (e.g., MNIST) and small models (e.g., ACAS). We may see how other tool authors will respond.

vtjeng commented 4 years ago

So maybe we ought to have 2 main categories only: ReLU and sigmoids. The present CNN could sit as a subcategory of the ReLU one. In the ReLU category we could have high (vision, classifiers) and low (Acas, ...) dimensionality inputs, some fully connected, some convolutional, etc. Some existing tools may not support all the benchmarks, but I think that's fine.

Thanks for the feedback! Let's see what other feedback comes in as people hopefully start to reply with what categories they're interested to see if this makes sense based on level of participation, or whether keeping them split is ideal.

I agree with Alessio's suggestion. If we are going to bring big models (e.g., MNIST) to this category, it makes sense to reduce to two categories regarding network structure: piecewise linear (including ReLU, convolution, max pooling, etc) and smooth nonlinear (sigmoid and tanh). Then orthogonally for each category, we can have big models (e.g., MNIST) and small models (e.g., ACAS). We may see how other tool authors will respond.

I'm generally on board with Alessio and Changliu's suggestions, and I like what @changliuliu pointed out about orthogonality. To expand a bit more on it, here are four orthogonal properties of categories that I've identified.

  1. Type of non-linearity used: piecewise linear (ReLu, max pooling) vs. non-piecewise linear (sigmoid, tanh)
  2. Type of linear layers used: fully-connected layers, convolutional layers, average pooling and skip connections are all linear transforms (but not every tool might support a particular linear layer natively).
  3. Input dataset, which affects the dimensionality of the input and the model complexity required to obtain non-trivial results: CIFAR10, MNIST, ACAS, ...
  4. Specifications: safety (as defined above), local robustness. (I'm sure I'm missing something else here).

Note that I've distinguished the type of non-linearity used from the type of linear layers used in the network structure, since convolutional layers can co-exist with non-piecewise linear activations (e.g. LeNet is tanh+convolution)

We don't have to have a category covering every possible combination of properties (non-linearity, input dataset, specification), but would just what they are for each category.

To avoid an explosion in the number of categories, rather than having additional categories for 2, we could just note in the overall summary what each tool is able to support natively.

vtjeng commented 4 years ago

Seperately: I'd like to enter the MIPVerify tool for this category as it is currently described.

More details on the tool:

GgnDpSngh commented 4 years ago

I would like to enter our tool ERAN in this category: https://github.com/eth-sri/eran

Cheers, Gagandeep Singh

guykatzz commented 4 years ago

Hi, Please add Marabou for this category, too.

Regards, Guy

FlorianJaeckle commented 4 years ago

The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof.

Cheers, Florian Jaeckle

ttj commented 4 years ago

Piecewise/ReLU Category Participants:

Oval Marabou ERAN MIPVerify nnenum NNV Sherlock

If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15).

haithamkhedr commented 4 years ago

Hi, Please add PeregeriNN for this category. We support fully connected layers with ReLU activations.

Regards, Haitham

pat676 commented 4 years ago

Hi,

We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/.

The toolkit supports:

Regards, Patrick Henriksen

ttj commented 4 years ago

Finalized Piecewise/ReLU Category Participants:

Oval Marabou ERAN MIPVerify nnenum NNV Sherlock VeriNet PeregeriNN Venus

For benchmarks, as indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. For this category by default to move things along, I suggest we use the ACAS-Xu networks as one benchmark set, and decide which of the networks and properties, if not all of them.

There has been earlier discussion regarding some classification networks in this category as well. If anyone has some and can provide details (e.g., links to models) that would fall into this category and not the CNN category (e.g., without convolutional layers), please speak up. Any other benchmarks are of course welcome, but details need to be provided soon (~May 15) so we may finalize things to give at least a month or so for everyone to look at them.

If we consider classification, as in the CNN category, there is a lot of parameterization (on the network architecture, the data set, the specification, etc.), so those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools.

pkouvaros commented 4 years ago

Hi

We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category.

Best

Panagiotis

haithamkhedr commented 4 years ago

I suggest we also consider adversarial robustness for MNIST fully connected NN with ReLU activations. Some MNIST trained models can be found here

stanleybak commented 4 years ago

For ACAS Xu, I suggest a subcategory of the following 10 set of networks / benchmarks I found to be difficult.

ACASXU-HARD:

  1. net 4-6, prop 1
  2. net 4-8, prop 1
  3. net 3-3, prop 2
  4. net 4-2, prop 2
  5. net 4-9, prop 2
  6. net 5-3, prop 2
  7. net 3-6, prop 3
  8. net 5-1, prop 3
  9. net 1-9, prop 7
  10. net 3-3, prop 9

edit: fixed type in prop 7 to net 1-9... not 1-7

pat676 commented 4 years ago

I agree with haithamkhedr in that we should consider adversarial robustness benchmarks for MNIST networks; however, we should also consider networks of different depths as som toolkits seem to handle deeper networks better than others. I can provide 3 MNIST fully connected networks with 2, 4 and 6 fully-connected layers and 256 nodes in each layer.

ttj commented 4 years ago

I agree with haithamkhedr in that we should consider adversarial robustness benchmarks for MNIST networks; however, we should also consider networks of different depths as som toolkits seem to handle deeper networks better than others. I can provide 3 MNIST fully connected networks with 2, 4 and 6 fully-connected layers and 256 nodes in each layer.

Thanks! Can you please provide links so participants can check them out?

pat676 commented 4 years ago

Thanks! Can you please provide links so participants can check them out?

I've uploaded the models in nnet format to a Dropbox folder.

ttj commented 4 years ago

Thanks! Please see the discussion in the CNN category regarding the networks and plans for classification, as they will partly apply here:

https://github.com/verivital/vnn-comp/issues/3#issuecomment-636199101

We have gone through these networks, and I think we will be able to use them, however, I would request as in the CNN case that the proposers provide these in ONNX if possible. We also have some feedforward networks that we will share soon in ONNX. The reason for this is because of the conversion process to ONNX, depending on how it is done, can introduce some changes in the networks (e.g., precision of weights), which the proposer of the network should check for consistency, etc. We can provide what we've created if there's a preference for that, but don't want to introduce problems.

For ACAS-Xu, given there hasn't been a lot of discussion on which properties/networks to focus on, I would suggest using all of them, with timing, verification, etc. results to be generated as a table for each participant. If we do that, we need to decide on specific timeouts to use given some of these can take quite a long time. Probably I would propose ~5min for each problem instance in ACAS-Xu, which would consist of the network, the property, and the input set.

stanleybak commented 4 years ago

Running all the networks / properties with a 5 min timeout sounds fine to me. We could use a larger timeout for the harder ones I mentioned above "ACASXU-HARD" too.

pat676 commented 4 years ago

Thanks! I agree that using the ONNX format is a good idea; I've converted the MNIST networks to ONNX and uploaded them to the same Dropbox folder as before.

ttj commented 4 years ago

Running all the networks / properties with a 5 min timeout sounds fine to me. We could use a larger timeout for the harder ones I mentioned above "ACASXU-HARD" too.

Thanks for the feedback! What timeout would you suggest for the hard ACAS-Xu examples?

ttj commented 4 years ago

Thanks! I agree that using the ONNX format is a good idea; I've converted the MNIST networks to ONNX and uploaded them to the same Dropbox folder as before.

Perfect, thanks very much, we'll test these in NNV shortly and the other participants also should look at them.

stanleybak commented 4 years ago

What timeout would you suggest for the hard ACAS-Xu examples?

Six hours.

stanleybak commented 4 years ago

Thanks! Can you please provide links so participants can check them out?

I've uploaded the models in nnet format to a Dropbox folder.

@pat676 do you also have the images and correct labels available? How would you propose to report the results? A large table would be okay, but a visual comparison is probably better. For each value of epsilon, there are a number of images (some safe, some unsafe), each of which with a different runtime up to some timeout such as one hour. The safe and unsafe cases could potentially be split into two figures / tables, when known. Probably small times are less important than larger times (for example 0.01 second vs 0.1 second is not really important, in my opinion. Do we want to use all the images or only the ones with larger runtimes? Alternatively, you could fix the image and increase epsilon until a timeout occurs.

ttj commented 4 years ago

ACAS-Xu ONNX models are available here:

https://github.com/dieman95/AcasXu/tree/master/networks/onnx

Given many tools use these already, if you already have these in your tool in some other way e.g. from another source (e.g., https://github.com/guykatzz/ReluplexCav2017/tree/master/nnet), that should be fine for this iteration, but please let us know of any issues using these models if you choose to use the ONNX ones (I think @stanleybak has already imported these for his tool). We'll share some fully connected MNIST classifiers soon (but should be similar to those already shared by @pat676 ).

After everyone has had some time to look over the various models to see if there are any problems, we'll centralize things into this repository.

stanleybak commented 4 years ago

I sent an e-mail about this, but since you mentioned it here Taylor: there is a question about the intended precision of the ACAS Xu networks. The ONNX files use single-precision floating point, whereas the original Reluplex files are ascii text and include scaling terms that look like double-precision. For example, the scaling term "7.5188840201005975" would lose precision if stored in a single-precision float (the 32-bit version gets rounded to "7.518884181976318359375" compared with "7.5188840201005975316661533724982" for the 64 bit version).

This will slightly affect the output of the network, and maybe it doesn't matter for the properties or this iteration. I guess we could also ask @guykatzz what the intended precision for the networks should be.

stanleybak commented 4 years ago

I'll also add that people should make sure they use input / output scaling when using ACASXu, which is not in the ONNX files.

The code I use to do this for input scaling is:

means_for_scaling = [19791.091, 0.0, 0.0, 650.0, 600.0, 7.5188840201005975]
range_for_scaling = [60261.0, 6.28318530718, 6.28318530718, 1100.0, 1200.0]

for i in range(5):
    input[i] = (input[i] - means_for_scaling[i]) / range_for_scaling[i]

Output scaling doesn't matter since the advisories use relative values of outputs, except for property 1, which checks an absolute limit on the output.

The code I use to construct a linear constraint for property 1 is this:

# unsafe if COC >= 1500

# Output scaling is 373.94992 with a bias of 7.518884
output_scaling_mean = 7.5188840201005975
output_scaling_range = 373.94992

# (1500 - 7.518884) / 373.94992 = 3.991125
threshold = (1500 - output_scaling_mean) / output_scaling_range
# spec is unsafe if output[0] >= threshold

The threshold value of 3.991125 is consistent with what Marabou uses for property 1: https://github.com/guykatzz/Marabou/blob/master/resources/properties/acas_property_1.txt

pat676 commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Regarding the reporting, a plot of the number of cases solved as a function of time may be the best way to visualise the results. In my opinion, a good granularity would be achieved by plotting safe cases, unsafe cases and each network separately.

I agree that images with long runtimes are more interesting than short runtimes; however, for several use-cases, the verification times for "easy" images are also important. For example, in adversarial training, we may want to use a verification algorithm to locate a large number of unsafe cases but don't really care about precision. The visualisation mentioned above should make it clear whether toolkits perform well for only "easy" or "difficult" cases, so in my opinion, we should include both.

Neelanjana314 commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Hi @pat676 could you please set the OpsetVersion to 8 or 9 when you create the .onnx files.

pat676 commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Hi @pat676 could you please set the OpsetVersion to 8 or 9 when you create the .onnx files.

Hi and thanks for the feedback; I've updated all onnx files the OpsetVersion to 9.

Neelanjana314 commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Hi @pat676 could you please set the OpsetVersion to 8 or 9 when you create the .onnx files.

Hi and thanks for the feedback; I've updated all onnx files the OpsetVersion to 9.

Thanks @pat676 for updating. Also, could you please set the input size as [1 784 1] while generating the .onnx files.

pat676 commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Hi @pat676 could you please set the OpsetVersion to 8 or 9 when you create the .onnx files.

Hi and thanks for the feedback; I've updated all onnx files the OpsetVersion to 9.

Thanks @pat676 for updating. Also, could you please set the input size as [1 784 1] while generating the .onnx files.

@Neelanjana314 I've updated all models to the new input size.

haithamkhedr commented 4 years ago

I've uploaded 50 images and their corresponding labels drawn from the MNIST test-set in csv format; all of the images are classified correctly by all three networks. I believe this should constitute a good dataset for the benchmarks, but let me know if I should upload more.

Hi @pat676 could you please set the OpsetVersion to 8 or 9 when you create the .onnx files.

Hi and thanks for the feedback; I've updated all onnx files the OpsetVersion to 9.

Thanks @pat676 for updating. Also, could you please set the input size as [1 784 1] while generating the .onnx files.

@Neelanjana314 I've updated all models to the new input size.

@pat676 Where can we find those models?

ttj commented 4 years ago

@pat676 Where can we find those models?

@haithamkhedr there's a Dropbox folder link above earlier in the discussion, copied here:

https://www.dropbox.com/sh/d6q4pbycexizz41/AAAcYN2R6mG2_kQxqZ3Y3i7ra?dl=0

stanleybak commented 4 years ago

So far I see ACASXu properts and L-infinity norm properties for images. Are there local robustness properties we want to do as well (l-infinity norm perturbation but on a smaller number of pixels)?

Also: has the list of benchmarks been finalized?

ttj commented 4 years ago

So far I see ACASXu properts and L-infinity norm properties for images. Are there local robustness properties we want to do as well (l-infinity norm perturbation but on a smaller number of pixels)?

Also: has the list of benchmarks been finalized?

Thanks @stanleybak I think that summarizes it well (MNIST and ACAS-Xu). We've gone through the MNIST networks proposed by @pat676 and I think we're able to use them (@Neelanjana314 can confirm), as well as ACAS-Xu, so I think this will be everything given our time constraints. We'll pull things out into the main repository out of the issues for this shortly unless there are any other final suggestions/concerns.

Neelanjana314 commented 4 years ago

So far I see ACASXu properts and L-infinity norm properties for images. Are there local robustness properties we want to do as well (l-infinity norm perturbation but on a smaller number of pixels)? Also: has the list of benchmarks been finalized?

Thanks @stanleybak I think that summarizes it well (MNIST and ACAS-Xu). We've gone through the MNIST networks proposed by @pat676 and I think we're able to use them (@Neelanjana314 can confirm), as well as ACAS-Xu, so I think this will be everything given our time constraints. We'll pull things out into the main repository out of the issues for this shortly unless there are any other final suggestions/concerns.

Thanks @ttj and yes, we're able to use the networks proposed by @pat676 and the ACAS-Xu as well.

stanleybak commented 4 years ago

I can provide 3 MNIST fully connected networks with 2, 4 and 6 fully-connected layers and 256 nodes in each layer.

When I load the networks from @pat676 I see some strange structure at the start of the network before the fully connected layers:

out

In particular, it looks like the start of the computation is node "23"... what's going on before that?

pat676 commented 4 years ago

I can provide 3 MNIST fully connected networks with 2, 4 and 6 fully-connected layers and 256 nodes in each layer.

When I load the networks from @pat676 I see some strange structure at the start of the network before the fully connected layers:

out

In particular, it looks like the start of the computation is node "23"... what's going on before that?

Hi @stanleybak,

It looks like the part of our code that was designed to handle non-standard input dimensions, I didn't expect this to show in the ONNX model.

I've modified the code to get rid of most of these operations; the only thing left is one flatten operation as I was told to use inputs of dim (1, 784, 1); however, the PyTorch model requires inputs on the form (1, 784).

Let me know if this looks okay to you.

stanleybak commented 4 years ago

Let me know if this looks okay to you.

Yes the network is great now @pat676.

One more weird thing that doesn't really need to be fixed is that the input images look like they want to be lists of integers, but instead they are floating point values close to integers. For example, in image1 you have entries like: 84.00000259280205,185.00000417232513,159.0000057220459

I'm not sure if this was on purpose. My guess was you started with a single-precision floating point value in the range [0, 1] and multiplied it by 255 and then forgot to round.