Benchmark discussion - Githubissues

stanleybak commented 1 year ago

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Update: benchmark submission deadline extended to June 2 (was ~~May 29~~).

regkirov commented 1 year ago

Hello. Last year our Collins Aerospace team provided a CNN benchmark to VNNComp (remaining useful life estimation). This year we are considering to contribute as well, with a different benchmark problem, but we wanted first to "sample" the interest in such benchmark.

We consider submitting an object detection use case, where the ML model is a YOLO neural network. Looking at the complexity of a YOLO (particularly of the output size) this would certainly be a challenging benchmark, but we were wondering - would it be feasible? Internally we have been trying to verify a simple robustness property on a YOLO model (actually, on a TinyYOLOv2) with several tools that participated in the 2022 VNNComp, but so far got no results - mainly crashing, out-of-memory, etc.

We will appreciate opinions from 2023 VNNComp participants. Would you be interested to try a YOLO benchmark? Would it be feasible for the tools or is it too complex yet? We can discuss some particular version of YOLO to be considered. Thanks!

ttj commented 1 year ago

@regkirov

Hello. Last year our Collins Aerospace team provided a CNN benchmark to VNNComp (remaining useful life estimation). This year we are considering to contribute as well, with a different benchmark problem, but we wanted first to "sample" the interest in such benchmark.

We consider submitting an object detection use case, where the ML model is a YOLO neural network. Looking at the complexity of a YOLO (particularly of the output size) this would certainly be a challenging benchmark, but we were wondering - would it be feasible? Internally we have been trying to verify a simple robustness property on a YOLO model (actually, on a TinyYOLOv2) with several tools that participated in the 2022 VNNComp, but so far got no results - mainly crashing, out-of-memory, etc.

We will appreciate opinions from 2023 VNNComp participants. Would you be interested to try a YOLO benchmark? Would it be feasible for the tools or is it too complex yet? We can discuss some particular version of YOLO to be considered. Thanks!

Thanks! Personally I think this would be great. Possibly at least we could reduce the input set size to have very low volume / very small perturbation to alleviate some of the OOM problems.

I am not aware of much, if any, work on object detection/localization as of yet (we did a very little on it with NNV a couple years ago). The closest related work is that done on segmentation, but again, few results there so far.

At any rate, I am supportive of this benchmark, and having it would incentivize community to work on object detection/localization, as well as any necessary model support to handle that.

AdrienBenamira commented 1 year ago

Dear all,

Our lab has developed a Neural Network architecture, called the Truth Table Net, that is designed to be easy to verify (more information can be found in this paper: https://arxiv.org/pdf/2208.08609.pdf).

Mainly, the CNNs layers can be encoded into CNF formulas.

We would like to propose two benchmarks:

The first benchmark will be focused on the robustness of neural networks to noise perturbations. Specifically, we propose to use MNIST, CIFAR10, and Imagenet datasets to evaluate the robustness of different neural network models. We believe that this benchmark will provide valuable insights into how different models behave in the presence of noise and other types of perturbations.

The second benchmark will focus on fairness verification using the Adult dataset. Our goal is to evaluate how well different models perform in terms of fairness, particularly with respect to race and gender.

We will provide Truth tables equivalent form of the model + CNF formulas + .onnx files if needed for each model along with a script to generate benchmark instances based on a random seed.

We will appreciate opinions from 2023 VNNComp participants. Would you be interested to try a Truth Tbale Net benchmark?

Thanks!

stanleybak commented 1 year ago

@AdrienBenamira This sounds great. How are you verifying fairness (what is the spec?)?

Truth tables equivalent form of the model + CNF formulas + .onnx files

If the input is not ONNX + VNNLIB, unfortunately, the automated tool evaluation scripts will not work, as tools need a common input format. VNNLIB is similar to CNF, any chance you could encode the properties into VNNLIB specification files?

AdrienBenamira commented 1 year ago

Thank you for your answer :)

How are you verifying fairness (what is the spec?)?

The definition I use is defined in this paper as (P3) https://www.comp.nus.edu.sg/~teodorab/papers/NPAQ.pdf

But we need to count the number of input that verify (P3) - with a model counting - not only find a contradiction - with a SAT/milp solver.

VNNLIB is similar to CNF, any chance you could encode the properties into VNNLIB specification files?

I can try. So if I can provide a TTnet benchmark in VNN + ONNX, the TTnet benchmark will be integrated in the competition ?

Also, can I propose a verification tool for the VNN competition solely for our benchmark?

ttj commented 1 year ago

@AdrienBenamira

VNNLIB is similar to CNF, any chance you could encode the properties into VNNLIB specification files?

I can try. So if I can provide a TTnet benchmark in VNN + ONNX, the TTnet benchmark will be integrated in the competition ?

Yes, if it is nominated as a benchmark, please see the rules doc, linked below but in the other issue #1 . Of course, in the ideal case, it will be relatively easy for others to consider it, and if it is in VNNLIB/ONNX (required), this will make things simpler. There are several others that did some non-standard things in the past (the database indexing one, last year's U-net one, etc.), where the spec can also possibly be transformed by the network. I would encourage you to look at the benchmarks from last year for reference.

Also, can I propose a verification tool for the VNN competition solely for our benchmark?

Yes, probably this will not be competitive for scoring, but the goal of the competition is to encourage participation and foster the community.

https://docs.google.com/document/d/1oF2pJd0S2mMIj5zf3EVbpCsHDNs8Nb4C1EcqQyyHA40/edit

naizhengtan commented 1 year ago

Hi @stanleybak , just to double-check I understand the rules correctly. If we only want to propose a benchmark without a verifier, we can do it this year without teaming up with a verifier team. Is that right?

stanleybak commented 1 year ago

Yes that's fine. A tool author needs to "nominate" your benchmark for it to be counted for scoring, but last year all benchmarks were nominated so I don't anticipate that will be a problem (unless you propose several benchmarks).

stanleybak commented 1 year ago

I want to propose a benchmark related to the VGGNET one last year (link). I'd like to modify it by increasing the number of inputs up to a full L-inf norm perturbations. This had issues with the parser last year.

AdrienBenamira commented 1 year ago

Thank you once again for organizing everything. The benchmark that we would like to submit to VNN2023 contains intellectual property belonging to our university. Therefore, we would like to inquire whether the code we would submit for the competition can contain licensing information in the header.

stanleybak commented 1 year ago

@AdrienBenamira Is it possible to make a non-IP version of the benchmark? Something like a network with a similar architecture / size / spec so that verification tools will have similar performance and the results would be usable for your restricted benchmark. That may be simpler than figuring out the licensing process.

naizhengtan commented 1 year ago

I want to propose a benchmark, NN4Sys (neural networks for computer systems). We will add a new application---Neural Adaptive Video---this year, along with two from last year.

stanleybak commented 1 year ago

Hi all, the organizers have extended the benchmark submission deadline to May 29 to give groups a little bit more time.

feiyang-cai commented 1 year ago

Hello! We are considering contributing a benchmark for conditional generative adversarial networks (cGANs).

The objective of this cGAN is to generate camera images that contain a vehicle obstacle located at a specific distance in front of the ego vehicle, where the distance is controlled by the input distance condition.

Here we attach some generated images as well as the architecture of the cGAN (including both generator and discriminator).

sample gan img

The generator takes two inputs: 1) a distance condition (1-d scalar) and 2) a noise vector that controls the environment (4-d vector). The output of the generator is a generated image.

The discriminator takes the generated image as input and outputs two values: 1) a real/fake score (1-d scalar) and 2) a predicted distance (1-d scalar).

For verification, we could combine these two components together and set proper verification specifications for input distance, input noise, and predicted distance.

We could also offer several different models with varying architectures (CNN and Transformer) and image sizes (32x32, 64x64) to provide a range of difficulty levels.

Would you be interested in such a cGAN benchmark? We would greatly appreciate any feedback, opinions, and suggestions from both competition organizers and participants. Thank you!

ChristopherBrix commented 1 year ago

The benchmark submission is finally possible on the website: https://vnncomp.christopher-brix.de/

All accounts have been activated, please try to submit your benchmark and let me know if there are any issues.

merascu commented 1 year ago

Hi all.

We want to contribute with verification benchmarks suitable for binarized neural networks robustness verification. They include layers like binarized convolution, max pooling, batch normalization, fully connected (no ReLUs).

Our models come from the classification of traffic signs (we used German, Belgian, Chinese datasets). We obtained accuracy ranging from 96% for German to around 80% to Chinese. The corresponding paper is under review at ICANN 2023 [1] and on arxiv.org [2].

As far as we could see from last year's competition only CNNs were handled, not binary CNNs. We would like to check:

if such a benchmark could be handled by some tool competing this year and, if not, how difficult it would be to extend it (we don't want to start from scratch)
how relevant is such benchmark for the verification community, in general,

Thank you!

[1] https://e-nns.org/icann2023/ [2] https://arxiv.org/abs/2303.15005

stanleybak commented 1 year ago

@merascu This sounds neat. I think there would be interest. What are the execution semantics? Does the binarization happen internally in the network or is it required that the inputs are binary? We may need to extend the VNNLIB spec format that we support in order to work with these.

merascu commented 1 year ago

@stanleybak

To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.

Just to make sure: other arguments like epsilon (for robustness properties), onnx model name, random seed, number of images should be hardcoded in the script? We checked last year benchmark and found [1] for CIFAR that we were thinking to modify for our benchmark. This script has more arguments.

Thank you!

[1] https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/cifar2020/src/generate_specs.py

merascu commented 1 year ago

@stanleybak

@merascu This sounds neat. I think there would be interest. What are the execution semantics? Does the binarization happen internally in the network or is it required that the inputs are binary? We may need to extend the VNNLIB spec format that we support in order to work with these.

The input of the models is not binary, they represent the pixels of the input images. The binarization is done internally by functions from Larq library [1].

[1] https://larq.dev

feiyang-cai commented 1 year ago

The benchmark submission is finally possible on the website: https://vnncomp.christopher-brix.de/

All accounts have been activated, please try to submit your benchmark and let me know if there are any issues.

Hi @ChristopherBrix,

Could you please activate my account? My account is "feiyang.cai@stonybrook.edu". Thank you!

stanleybak commented 1 year ago

The input of the models is not binary, they represent the pixels of the input images. The binarization is done internally by functions from Larq library [1].

@merascu can you successfully export the files to onnx? If so, it may be easy to get something in the right format. I don't think any tools would support the layers you need at the moment, but having the benchmark in place may still be valuable.

ChristopherBrix commented 1 year ago

All new accounts are activated.

ChristopherBrix commented 1 year ago

My university's webserver ist currently offline, so the submission website is no reachable. If they don't fix that today, I'll set it up somewhere else. Sorry for the inconvenience!

merascu commented 1 year ago

@stanleybak Yes, we were able to export in onnx format. We will add the benchmark.

Later edit! The benchmark is here: https://github.com/apostovan21/vnncomp2023

ChristopherBrix commented 1 year ago

The website is online again. Let me know if there are any other issues.

Rajpreet2206 commented 1 year ago

https://vnncomp.christopher-brix.de/

NOT WORKING !!!

ChristopherBrix commented 1 year ago

Thanks for the heads up, it's fixed now. I've changed the setup so the web server restarts automatically, hopefully this will avoid further downtime.

regkirov commented 1 year ago

Hi @ChristopherBrix Could you please activate the account "dmitrii.kirov@collins.com"? Thanks!

ChristopherBrix commented 1 year ago

Done

pomodoromjy commented 1 year ago

Hi all

We've submitted CCTSDB YOLO benchmark, which refers to a patch-level object detection problem related to the autonomous driving. We would greatly appreciate any feednack and discussion from everyone.

The link is https://github.com/pomodoromjy/vnncomp-2023-CCTSDB-YOLO

Thank you!

stanleybak commented 1 year ago

By popular request, we moved the deadline from Monday to Friday June 2. No further extensions for benchmark submission will be possible.

HanjiangHu commented 1 year ago

Hi all,

We've submitted the MetaRoom benchmark, which is for the robustness verification of classification models against camera motion perturbation in robotics applications. The link is https://github.com/HanjiangHu/metaroom_vnn_comp2023. Any feedback and discussions are highly appreciated!

Thank you!

apostovan21 commented 1 year ago

Hi @ChristopherBrix I have created an account few days ago and it hasn't been activated yet. Could you please check?

ChristopherBrix commented 1 year ago

Weird, I thought I had activated all of them, but now I see yours. Please check, it should work now

regkirov commented 1 year ago

Hi All,

Our team at Collins Aerospace Applied Research & Technology has submitted a benchmark for an object detection neural network (we used YOLOv5 nano to limit the complexity). All activation functions are Leaky ReLU. The application is maritime search and rescue with unmanned aerial vehicles.

Here is the link: https://github.com/loonwerks/vnncomp2023

We are glad to participate again in the competition. We will appreciate feedback that will help us to improve the benchmark.

Cheers

jferlez commented 1 year ago

Hi everyone,

I would like to re-propose the tllverifybench benchmark that was included in last year's competition.

Here is the link: https://github.com/jferlez/TLLVerifyBench

The benchmark is unchanged from last year except for a slight modification to the way that properties are generated: safe/unsafe properties are now equally probable. (Last year was skewed towards unsafe properties, i.e. those for which a counterexample should be returned.)

James

Z-Haoruo commented 1 year ago

Hi all,

Our team, comprised of members from the Georgia Institute of Technology and the Los Alamos National Laboratory, has submitted a benchmark named ml4acopf for the robustness verification of machine learning regression models for the AC optimal power flow (ACOPF) problem. Our benchmark includes various activation functions such as ReLU, sigmoid, trigonometric functions, and etc.

You can access the benchmark at the following link: https://github.com/AI4OPT/ml4acopf_benchmark/

We greatly appreciate any feedback or discussion.

Thank you!

wu-haoze commented 1 year ago

Hi all,

We would like to propose a benchmark set dist-shift for verifying robustness against distribution shifts beyond norm-bounded perturbations. The benchmarks involve ReLU and sigmoidal activations.

The benchmarks are available at: https://github.com/anwu1219/dist-shift-vnn-comp

Thank you! Andrew

lydialin1212 commented 1 year ago

Hi all,

We’ve submitted the NN4Sys benchmark for the verification of neural networks for computer systems. The benchmark contains three applications: Learned Video Stream, Learned Index, and Learned Cardinality.

Hers is the link: https://github.com/Khoury-srg/VNNComp23_NN4Sys

Any feedback and discussions are much appreciated!

Thank you!

shizhouxing commented 1 year ago

Hi All,

We are proposing a benchmark on verifying Vision Transformers (ViTs): https://github.com/shizhouxing/ViT_vnncomp2023

Thanks!

xiangruzh commented 1 year ago

Hi all,

We have submitted a YOLO benchmark for verifying robustness of an object detection model. Our model is based on TinyYOLOv2, with modifications to the backbone.

The link is: https://github.com/xiangruzh/Yolo-Benchmark

Thank you!

apostovan21 commented 1 year ago

Hi all,

We have submitted benchmarks suitable for binarized neural networks robustness verification. They include layers like binarized convolution, max pooling, batch normalization, fully connected (no ReLUs). Our models come from the classification of traffic signs GTSRB.

The link is: https://github.com/apostovan21/vnncomp2023

Thanks!

haydn-jones commented 1 year ago

@HanjiangHu Can you provide a description of your ProjectionOp somewhere so teams can more easily implement support for that operator?

HanjiangHu commented 1 year ago

@HanjiangHu Can you provide a description of your ProjectionOp somewhere so teams can more easily implement support for that operator?

Thanks for pointing out this. We have added a more detailed description of ProjectionOp and its usage in the README file here in the benchmark repo. Feel free to let me know if there are any other questions.

mldiego commented 1 year ago

@HanjiangHu From the models that I looked over, the custom Projection OP is the first "layer" in the networks.

I don't know if it is too late for this, but could you update the files (both onnx and vnnlib) to neural networks starting after the custom ProjectionOP? Meaning the input layer is directly connected to the first Conv layer.

And same for the vnnlib files, instead of defining the input as a 1-dimensional input, create the vnnlib files after the Projection OP, so the input dimensions X (X_0, X_1, ... X_N) correspond to the input variables of the first Conv layer.

That will avoid having participants add support for a custom function that will not necessarily be used in any future benchmarks and help more participants analyze them.

It also seems that the custom projection operation requires loading a dataset and do some type of lookup operation to generate the output of that layer? https://github.com/HanjiangHu/metaroom_vnn_comp2023/blob/main/randgen/custom_projection.py#L25

I can't speak for everyone, but when I load those in MATLAB (what we use), I cannot get any information about that layer / operation, just that it is a custom operator, so it will be very very difficult for us to support this benchmark at the moment.

stanleybak commented 1 year ago

I agree that if the spec is done after the custom OP it may be supported by more tools. For it to count as a scored benchmark we'd want at least two tools that support it.

HanjiangHu commented 1 year ago

@HanjiangHu From the models that I looked over, the custom Projection OP is the first "layer" in the networks.

I don't know if it is too late for this, but could you update the files (both onnx and vnnlib) to neural networks starting after the custom ProjectionOP? Meaning the input layer is directly connected to the first Conv layer.

And same for the vnnlib files, instead of defining the input as a 1-dimensional input, create the vnnlib files after the Projection OP, so the input dimensions X (X_0, X_1, ... X_N) correspond to the input variables of the first Conv layer.

That will avoid having participants add support for a custom function that will not necessarily be used in any future benchmarks and help more participants analyze them.

It also seems that the custom projection operation requires loading a dataset and do some type of lookup operation to generate the output of that layer? https://github.com/HanjiangHu/metaroom_vnn_comp2023/blob/main/randgen/custom_projection.py#L25

I can't speak for everyone, but when I load those in MATLAB (what we use), I cannot get any information about that layer / operation, just that it is a custom operator, so it will be very very difficult for us to support this benchmark at the moment.

Sorry for the inconvenience when running with the customized operator. We can make the spec on the pixel level after the projection OP and new ONNX and VNNLIB will be released shortly.

There might be one concern that since the images are projected from dense point cloud along one-axis camera movement, images are captured in a one-dim manifold and all pixel values are not independent. So the specs of all pixel values after projection OP can be regarded as a hyper-rectangular relaxation of the original one-dim camera motion spec before projection OP. For a fair comparison, we will try to make them equivalent by making the spec tiny enough.

ChristopherBrix commented 1 year ago

I've pushed all submitted benchmarks to https://github.com/ChristopherBrix/vnncomp2023_benchmarks

Those are

Collins-YOLO-robustness (@regkirov)
NN4Sys (@lydialin1212)
ViTs (@shizhouxing)
YOLO (@xiangruzh)
cGAN (@feiyang-cai)
vggnet16_2023 (@stanleybak)

There are some issues:

CCTSDB-YOLO (@pomodoromjy): The total timeout is too small (it should be at least 3 hours)
MetaRoom (@HanjiangHu): The total timeout is too large (it should be at most 6 hours)
ml4acopf (@Z-Haoruo): The total timeout is too small (it should be at least 3 hours)
Traffic signs recognition (@merascu, @apostovan21): The total timeout is too large (it should be at most 6 hours)

@anwu1219: dist-shift doesn't work on the automated testing platform. Could you plese look into that? @jferlez: I standardized the format for the benchmark submissions a bit. Thus, your updated benchmark doesn't work anymore. Could you please update its structure to fix that?

Please ping me once the bugs in the benchmarks are resolved!

merascu commented 1 year ago

@ChristopherBrix Thanks for the update. We are not sure to which total timeout are you referring too. Which tests are exceeding this value? Could you please give us a report with execution times. Thanks!

ChristopherBrix commented 1 year ago

Your benchmark seems to consist of 745 instances with a timeout of 5 minutes each (https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/traffic_signs_recognition/instances.csv). However, the rules state that Per benchmark: the runtime sum of all verification instances in a benchmark will take at most 6 hours.

stanleybak / vnncomp2023

Benchmark discussion #2