verivital / vnn-comp

12 stars 8 forks source link

Category Proposal: Convolutional Neural Networks #3

Open ttj opened 4 years ago

ttj commented 4 years ago

Convolutional layers, presumably in networks for classification

Representative benchmark(s): MNIST, CIFAR classifiers

Specifications: robustness

Questions: any further restrictions or should this be broken out further, e.g., based on other layers (batch normalization, max pool, etc.)?

Allow batch normalization?

Allow max pool, avg pool, etc.?

ttj commented 4 years ago

Hi @harkiratbehl , I guess the @ttj can reply to this better, I am not sure if non-onnx format is allowed atm?

This is fine for this iteration, and one of the reasons we're doing this is to try to sort out issues like this regarding model interchange, etc.

pat676 commented 4 years ago

@FlorianJaeckle did you manage to add satisfiable properties? Since most networks will normally have counterexamples and finding these is important in various ways, at Imperial we do feel they should be added to the mix.

Thanks.

harkiratbehl commented 4 years ago

Hi @harkiratbehl , I guess the @ttj can reply to this better, I am not sure if non-onnx format is allowed atm?

This is fine for this iteration, and one of the reasons we're doing this is to try to sort out issues like this regarding model interchange, etc.

Hi @GgnDpSngh,

It seems like we can use the Pytorch version of the architecture. Can you please confirm whether the Pytorch models in https://github.com/eth-sri/colt/tree/master/trained_models are the same as the models in the onnx folder https://github.com/eth-sri/colt/tree/master/trained_models/onnx?

Regards Harkirat

GgnDpSngh commented 4 years ago

@harkiratbehl yes that is true.

Cheers, Gagandeep Singh

stanleybak commented 4 years ago

The pandas tables now have three columns only: all images are taken from the cifar20 test set and the Idx column refers to the image index

@FlorianJaeckle did you mean cifar-10? Where are the images for cifar-20? Is there an easy way to download the 100 images from the dataset with the corresponding indices in your benchmarks? Is there any normalization applied to the inputs? Are the network inputs expected to be in the range [0, 1] or something else?

stanleybak commented 4 years ago

I would suggest having a timeout between 1-5 minutes depending on the size of the network.

@GgnDpSngh to be more specific, this would be 1 minute for the MNIST networks and 5 for the CIFAR ones?

vtjeng commented 4 years ago

@ttj with the Gurobi solver, I can control how many threads it uses on the machine. Would you prefer the comparison to be made single-threaded? (Switching it between one and four threads makes a significant difference on the longer solve times).

ttj commented 4 years ago

@ttj with the Gurobi solver, I can control how many threads it uses on the machine. Would you prefer the comparison to be made single-threaded? (Switching it between one and four threads makes a significant difference on the longer solve times).

Parallel/concurrent is fine to use. If a method has been developed to support this, it might as well take advantage of it. There are some tradeoffs here wrt to "fairness" and single vs. multi-threaded performance, but as we aren't doing any detailed performance evaluation this iteration, I don't see this as problematic.

vtjeng commented 4 years ago

@stanleybak I think mnist_0.1 is smallest, mnist_0.3 and cifar_8_255 have the same structure with wider layers, and cifar_2_255 is the largest. I had been running my solver with a 15 minutes timeout but we could certainly agree on a lower timeout.

Also, what timeout did you end up selecting for @FlorianJaeckle 's dataset? I can align my timeouts with yours, at least.

@GgnDpSngh how did we want to handle cases where the image is incorrectly classified? I currently report this as (trivially) SAT.

GgnDpSngh commented 4 years ago

@stanleybak the mnist_0.1 is easy to prove, 1 minute should be fine, for the rest 5 minutes should be good.

@vtjeng we do not want to handle cases where the image is incorrectly classified.

Cheers, Gagandeep Singh

stanleybak commented 4 years ago

Also, what timeout did you end up selecting for @FlorianJaeckle 's dataset?

I was going to do the "1 hour for all properties", although I'm hoping they verify faster than that. Haven't ran the analysis yet for those ones.

stanleybak commented 4 years ago

@FlorianJaeckle just to confirm, for the images, should we be using the CIFAR-10 test batch? The dataset (http://www.cs.toronto.edu/~kriz/cifar.html) also includes 5 data batches.

There was also some strange behavior in the deep100 table, where the index and prop columns were sometimes stored as floating-point numbers instead of ints, but that's a small issue.

Also for others: are you just analyzing the 100 image versions of this, or both the 100 and 20 image subsets?

stanleybak commented 4 years ago

@FlorianJaeckle I'm having trouble running the networks on the cifar dataset. The produced labels don't match the expected cifar label. Are we supposed to scale the pixel values to [0,1]? Is there normalization we're supposed to be using for each channel? Are the channel orders R,G,B? Is the classification label the max or the min? Can I confirm the images are the test batch and not one of the other batches?

Was anyone else able to execute these networks?

FlorianJaeckle commented 4 years ago

Hi @stanleybak, We do use the test batch of the CIFAR-10 dataset. We normalise the images with the following parameters: mean=[0.485, 0.456, 0.406] std=[0.225, 0.225, 0.225]. Apologies, I realise I should have mentioned this before. The channel order is indeed R,G,B. The predicted classification label is the max. I hope this solves your issue.

stanleybak commented 4 years ago

Hi @FlorianJaeckle,

That helped! Now things are classifying correctly. One more question: is the epsilon perturbation applied before normalization, or after?

FlorianJaeckle commented 4 years ago

@stanleybak, it's applied after normalisation.

stanleybak commented 4 years ago

@FlorianJaeckle are they all unsat in the 100 instance files? I'm getting some sat instances and want to make sure I'm running things correctly.

vtjeng commented 4 years ago

@FlorianJaeckle, I've generally seen the epsilon perturbation being applied before normalization (that is, if x is the input and e is the perturbation, we have the perturbed input as Normalize(x+e), where e is norm-bounded).

To confirm, in your case, this is Normalize(x)+e instead?

GgnDpSngh commented 4 years ago

also, the timeout is 1 hour for all or per property? We are testing 100 or 20 properties for each network?

Cheers,

FlorianJaeckle commented 4 years ago

@stanleybak, all properties are unsat; i.e. there does not exists a counter-example, in other words the network is robust around the image. Can you give an example of a property, where your method returned an adversarial example?

@vtjeng, given an image x and an epsilon value, the input domain for the CNN is [Normalize(x)-e, Normalize(x)+e]

@GgnDpSngh, the timeout is 1 hour per property.

stanleybak commented 4 years ago

For the normalization, if you want, you can convert to the epsilon values to prenormalized values by multiplying by the std_deviation (0.225), since it's the same for all channels. I also assume there's trimming at the limits, which wasn't mentioned before.

@FlorianJaeckle , I think I found my problem. Since this is there is a target class, is the correct condition to check (1) output[original_label] < output[target_label] or (2) argmax(output) == target_label?

GgnDpSngh commented 4 years ago

@stanleybak since they want that the target label is not chosen for perturbed images, I guess the condition to verify should be that output[original_label] > output[target_label] holds, this is how I have been running ERAN but maybe @FlorianJaeckle can confirm?

Cheers, Gagandeep Singh

FlorianJaeckle commented 4 years ago

@stanleybak the first condition is the correct one. @GgnDpSngh is right: since we want to check whether the target_label is chosen ahead of the original label, the condition is indeed output[original_label] < output[target_label].

GgnDpSngh commented 4 years ago

@FlorianJaeckle i am confused now, are we verifying output[original_label] < output[target_label] holds

or the other direction that i mentioned?

Cheers,

joshua-smith4 commented 4 years ago

We are running into a compatibility issue between our tool's requirements and the networks that we are all evaluating on. If you have experience translating between ONNX format and Tensorflow protocol buffers, we would really appreciate you reading on and providing suggestions to overcome this obstacle. Our tool uses a modification of the FGSM algorithm proposed by Ian Goodfellow to perform some "refutation-based" verification prior to the main verification routine as an optimization. FGSM requires access to the gradient of the loss function with respect to the input. We have written our tool to accept the label of the tensorflow graph node associated with this gradient as an input. Does anyone know of a way that we could modify the provided networks to gain access to this gradient? Any help and suggestions would be greatly appreciated. Thanks, Josh

FlorianJaeckle commented 4 years ago

@GgnDpSngh we have formulated our problem as follows: \exists x' \in [x-\epsilon, x+\epsilon] such that: f(x')_{target\_label} > f(x')_{original\_label}, where f(x')_{i} is the score given by the CNN f to the output class i, for a given input x'. We solve this equation by either finding an x' that satisfies it (an adversarial example), or by proving that the equation is UNSAT (and thereby verifying robustness).

GgnDpSngh commented 4 years ago

@FlorianJaeckle alright thanks.

Cheers,

GgnDpSngh commented 4 years ago

So in this category, we are trying the following benchmarks: mnist_0.1.onnx, mnist_0.3.onnx, cifar_2_255.onnx, cifar_8_255.onnx for robustness on all the images correctly classified among the first 100 in the test set,

then for benchmarks from the oval group, we use: base.onnx, wide.onnx, deep.onnx with the 20 properties in the panda table, is that all or I am missing some benchmarks?

Cheers,

stanleybak commented 4 years ago

@GgnDpSngh These are the ones I've been looking at, although for oval they also have panda tables with 100 properties that's a superset of the 20 properties. Since the timeout is one hour each, though, maybe we should focus on the 20 properties version? Also, do you think it make sense in mnist_0.1 to use a timeout of 5 minutes as well, just so they all have same timeout?

GgnDpSngh commented 4 years ago

@stanleybak the deadline is Friday now, so there might be time to run the 100 properties that the oval group suggests but we wanted to be uniform. It should be fine to use a timeout of 5 minutes for the mnist_0.1 network as well.

Cheers,

stanleybak commented 4 years ago

Does anyone know of a way that we could modify the provided networks to gain access to this gradient?

@joshua-smith4, I made a python3 script to convert .onnx files to tensorflow1 .pb files that might help. It prints some warnings but it seems to produce files that run correctly. The code is here: onnx_to_tensorflow1.py.txt

I'm using python package versions numpy 1.17.4, onnx 1.7.0, onnxruntime 1.3.0, tensorflow 2.2.0

Neelanjana314 commented 4 years ago

Hi, @stanleybak I got the following images as UNSAT whereas its mentioned as UNKNOWN in the table: Mnist_03 : 28, 68, 70, 77 Cifar_8_255: 12,89

Should I change them as UNSAT in the ggn-all-results.txt ?

Thanks Neelanjana

stanleybak commented 4 years ago

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

ttj commented 4 years ago

@Neelanjana314, Yes please update the table with the classification. The result columns are in the data/*-front.txt files. Also point out if there are any inconsistencies.

Yes, we need to specify the sat/unsat/unknown result of each tool, right? I think it's okay to have an individual column with the result if all tools are almost the same and only a few inconsistencies, but if there are more than a few, we may want to present in another way, e.g., including the result column for each tool.

stanleybak commented 4 years ago

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

ttj commented 4 years ago

My hope is there are few inconsistencies, as this indicates soundness issues. They also should be easy to resolve, since the tool that produces a SAT result can provide a concrete unsafe input which we can all check. If it's just a few cases, maybe we can mark the result as INCONSIS or something similar until it's resolved. Another idea is use colors along with the times so for example green would be SAT and red UNSAT. Adding an extra column for each tool when they're in agreement 95% of the time would be a mess, in my opinion.

Thanks! Yes, I agree for presentation to not include the result column for each tool as long as they match as it's not necessary and makes the presentation poor, but like your solution from #8 to have it available so we can easily check across tools.