Benchmarks Discussion - Githubissues

stanleybak commented 2 years ago

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Once this succeeds, please post a link to the completed benchmark site, for example this one.

Here is a link to last year's github issue where we discussed benchmarks. Here is a link to the folder containing final benchmark files last year.

pomodoromjy commented 2 years ago

Would two onnx network inputs and two onnx network outputs be accepted for the compitition?

Motivation

Currently, most networks evaluated in the literature are focusing on image classfication. However, in most practical scenarios, eg. autonomous driving, people pay more attention to object detection or semantic segmentation. Considering the complexity of the object detection, we propose a new simple Unet(four Conv2d layers followed with BN and ReLu). We advocate that tools should handle more practical architectures, and the simplified Unet is the first step towards this goal.

Why we need two onnx network inputs and two onnx network outputs?

In the image classification task, we can easily say that the image is classified correctly or wrongly. However in the semantic segmentation task, we need to calculate that how many are classified correctly and use a threshold to judge whether the task is succeed or not. Thus we have to use two onnx network inputs, where the one input is RGB values of images as usual and the other one is ground truth which is used for the statistical purrpose, i.e, how many pixels are classified correctly. For the same reason, we have to design two onnx network outputs, where the one is the logits as ususal and the other one is for statistical purpose.

微信图片_20220424231557

Question

It seems like it's impossible to seperate two aspects since we need train the semantic segmentation model and we also need to compare the output of semantic segmentation model with the ground truth. How should we modify our benchmark, so more tools can be applied to our benchmark?

trhoangdung commented 2 years ago

Hi Jinyan,

Thank you very much for the questions and comments. Verification for semantic segmentation networks remains challenging for most of the verification techniques now due to the huge output space (number of classified pixels). We have worked on this problem and the implementation in NNV. I am aware of another method from ETH Zurich. The metrics we have proposed to evaluate the robustness of a network under adversarial attacks such as brightening or darkening attacks are:

1) Robustness value: the percentage of robust pixels under un attack 2) Robustness sensitivity: the average number of pixels affected (misclassified) if one pixel is attached 3) Robustness IoU: robust IoU is similar to IoU used in training SSN

Important note!!! robustness is different from accuracy. We don't use ground truth segmented images with 100% accuracy for statistical evaluation of the robustness. The reason is that your trained network cannot achieve 100% accuracy. Therefore, using user-provided ground truth segmented images to evaluate the robustness is inappropriate. The solution is, with one image, we execute the network to get a ground truth output segmented image. Under bounded attack to the input image, we compute all possible classes of all pixels in the segmented output. Then, we compare it with the ground truth image to compute robustness value, robustness sensitivity, and robustness IoU.

For the details, you can find our analysis here. (https://drive.google.com/file/d/1eiflLYjRtg4G0s4TglyCq21Z30OEaKFj/view)

Another note is that, for upsampling, please use dilated or transposed convolution, don't use un-max-pooling.

Tran

pomodoromjy commented 2 years ago

Thank you, Tran.

We have one more question. Do we still need two onnx network inputs if we need to compare the output with the ground truth in the onnx network? More specifically, we take the original image as input and excute the network to get a ground truth output. Then we replace the user-provided ground truth to the new ground truth. How can we merge the new ground truth and the disturbed images as one input?

trhoangdung commented 2 years ago

Hi Jinyan,

When you have a single image and attack that image by some bounded unknown attack, for example, brightening some pixels of the image, then what you get is NOT A SINGLE DISTURBED IMAGE but an infinite (but bounded) set containing all possible disturbed images. With that, a pixel may be classified into many classes. The robustness verification is to verify if the class of that pixel is changed under the attack.

Therefore, what we need to provide to a verification algorithm is the network, an input image, and a model of the attack, not a concrete disturbed image. If you provide a concrete disturbed image and a ground truth, we just need to execute the network on the disturbed image and analyze the result in comparison with the ground truth as a normal evaluation of the accuracy. This execution-based of robustness evaluation is fundamentally different from our verification purpose.

Tran

pat676 commented 2 years ago

Hi all,

We have a benchmark which we think would be interesting for the competition; however, we are still waiting to hear back from some stakeholders about whether it is okay for us to publish it. We are aware that the deadline for proposing benchmarks is today, but would it be possible to get a few days of extension?

I am also not sure how many benchmarks each participant is allowed to publish this year? If it is more than one we are also happy to provide the same fully connected MNIST benchmarks as last year.

Best, Patrick

mnmueller commented 2 years ago

Hi all,

I would like to propose the following two benchmarks both based on CIFAR10 Image Classification using ResNets: ~https://vnncomp.christopher-brix.de/benchmark/details/28~ ~https://vnncomp.christopher-brix.de/benchmark/details/27~ https://vnncomp.christopher-brix.de/benchmark/details/74 https://vnncomp.christopher-brix.de/benchmark/details/75 Edit: There was a typo in the epsilon for these leading to way too hard properties. I updated them accordingly.

I think we should also try to have the benchmarks ready a bit closer to the agreed-upon deadline than last year so that all participants have sufficient (and the same) time to make the required adaptions in their tools/parsing code.

Regarding the suggestions from @pomodoromjy, I think it would be important, that the benchmark conforms to the vnnlib standard and has just one input and output. Probably, we would also want to make sure that it is not completely out of reach for all tools to avoid having all tools run the full 6h on that benchmark.

What do the other participants think?

Cheers, Mark

huanzhang12 commented 2 years ago

Our team aims to propose a benchmark with collaborations from researchers in non-ML community. We will also greatly appreciate it if we can have a few days extension. The rule document says the benchmarks need to be prepared by May 31. Is that the final hard deadline for benchmark submission (or maybe a few days before that?) And the rule document also says that each team can provide up to 2 benchmarks (one benchmark may be selected from non-participant's proposal and should not duplicate the other benchmark). I hope there will be more benchmarks from non-participants in the next a few days for selection.

And yes, I agree with @mnmueller that benchmarks need to conform to vnnlib format. The format is indeed a bit restricted and maybe difficult to apply to some application scenarios (we had to workaround this as well), but I guess for this year's competition we should keep this format.

naizhengtan commented 2 years ago

Hi all,

We want to submit a Neural Network for Systems (NN4Sys) benchmark (team up with Huan/Kaidi's team). We're almost there---we're tuning our specification difficulties for a more balanced benchmark.

This benchmark will include four NN models: two from the learned database index (see this paper) and two from cardinality estimation (see this paper). We will append network architectures shortly.

We design the specifications to reflect the semantic correctness of these system components. In particular, for the learned index, the spec requires the predicted data positions to be close to the actual data positions. And for cardinality estimation, spec indicates that the predicted cardinality should be accurate, namely, within an error bound of the number of rows regarding the SQL statement.

stanleybak commented 2 years ago

Hi all. Yes the benchmark deadline on the website was updated to May 31, 2022, to match the rules document. Please try to stick with this date as participants need time to ensure their tools will run on all the benchmarks.

pomodoromjy commented 2 years ago

Hi all,

Regarding @mnmueller and @huanzhang12 's suggestion, we will try some new ways to our scenario.

A simple specification file would check to ensure that a single pixel of the output did not change classes. Can a more complicated specification file check to ensure that how many percentages of the pixels of the output did not change classes? @trhoangdung, do you know the answer? Is it possible to compute robustness value, robustness sensitivity, and robustness IoU using vnnlib? How would this look?

Best Jinyan

ChristopherBrix commented 2 years ago

Please note that the website for the benchmark tests is currently experiencing some issues. I will investigate this tomorrow and post an update once it's fixed. Sorry for the inconvenience!

ChristopherBrix commented 2 years ago

The bug has been fixed, submitting benchmarks should work fine now. If you notice that your code is stuck (no change in the log files after several minutes) you now have the option to abort the execution and submitting it again.

pat676 commented 2 years ago

Hi,

We want to propose the following benchmarks:

mnist_fc: https://vnncomp.christopher-brix.de/benchmark/details/53 cifar_biasfield: https://vnncomp.christopher-brix.de/benchmark/details/55

The first benchmark is the same as proposed last year, three mnist fully connected networks with 2, 4 and 6 layers.

The second benchmark verifies a CIFAR fully convolutional network against bias field perturbations. Bias field verification is explained in detail here. Quickly summarised: a bias field models a smooth noise and is, in this case, represented by a spatially varying 3rd order polynomial.

The bias field transformation is encoded by prepending a fully connected layer to the cifar network; the FC layer takes the 16 coefficients of the bias field polynomial as input. The verification problem considers perturbations of these coefficients. We do the encoding in generate_properties.py, thus any toolkit supporting the standard l-infinity specifications should also support this problem "out-of-the-box".

We note that the encoding depends on the input image, so the random draws in this benchmarks affects the onnx models generated by generate_properties.py, not the vnnlib files.

regkirov commented 2 years ago

Can you please clarify the following:

Timeouts for properties of a single benchmark should sum up to 6 hours? Or less?
Is the *_instances.csv file also required where one specifies the mapping between properties and NNs, as well as the timeout for each entry?

Thanks!

stanleybak commented 2 years ago

@regkirov 1. yes total sum over all instances should not be more 6 hours, 2. yes please have an instances.csv file. Note this can be created with some randomness based on a passed in seed to the geneartion script that you provide (all benchmarks need this during this year).

regkirov commented 2 years ago

Hello! I have the following error in the logs during benchmark submission:

AssertionError: failed parsing line: (declare-const X_9 Bool)

Some inputs to our benchmark NN are Boolean. Does VNNLib only support Real type?

stanleybak commented 2 years ago

@regkirov I think it's more of a restriction of the tools. I don't think any tools from last year would support such specs, as maybe you can encode SAT problems with booleans. I'll note this down as some extension we can discuss. For now, would it make sense to encode them as real's between 0 and 1?

regkirov commented 2 years ago

Thanks @stanleybak Yes, for now it would not be a problem to encode all inputs as real's - I will do that and resubmit.

In general, this could be a good extension, for example, to avoid having CEX that are not realistic. Currently, for NNs with boolean/binary inputs, one would need to check every CEX against integrality constraints on such inputs. In other words, if a tool returns a CEX with the value, say, 0.4 for some input that can only be 0 or 1 - this would be valid for the property spec where every input is Real, but not meaningful for the model, because values between 0-1 can never occur.

regkirov commented 2 years ago

@ChristopherBrix @stanleybak Our benchmark includes several NNs with different input size. Due to this difference, the properties defined for one NN cannot be run on another NN. As discussed above, we provide an instances.csv file that maps vnnlib specs to corresponding NNs. However, in the logs I see a mismatch. Does the automated procedure consider the instances.csv file? Is there some naming convention to be followed? If the csv is not considered, any suggestion on how to pass the check avoiding the mismatch of properties and NNs? Thanks!

ChristopherBrix commented 2 years ago

You're right, instances.csv is currently not used, all properties are checked against every NN. I'll work on that over the weekend and post an update here once that's fixed.

jferlez commented 2 years ago

Hi everyone,

I would like to propose a TLL NN benchmark based on the networks we used in a couple of our recent papers (e.g. FastBATLLNN).

I have a repository with this benchmark, but the automatic validation currently fails with error OSError: [Errno 28] No space left on device. I suspect that the problem is the size of the ONNX networks in the benchmark, since in uncompressed form, they occupy ~2.5 GB.

For some additional, specific context: this error occurs while trying to unzip an ONNX network using Python's zipfile package. In an offline conversation, @stanleybak suggested that I use generate_properties.py to unzip each ONNX file as needed during property generation. Since they compress very well, this allows them to fit easily within the standard github file-size limits. (The largest ONNX file is ~300MB uncompressed vs. ~1.5MB in zip form.)

Assuming there isn't a bug in my code: is the disk space on the instance a hard constraint?

If so -- and assuming there's sufficient interest in this benchmark -- then there are a few ways I can adapt our benchmark to require less disk space. For example, the benchmark is currently configured to generate one property per network, with the number of networks determined solely by the timeout -- I could easily generate multiple properties per network instead, and simply decompress fewer networks. Considering only the smaller network sizes is another possibility: the repository contains many more networks of each size than are matched to properties in the current benchmark.

Of course I'm open to other ideas or modifications to help include this benchmark if there's interest.

Thanks in advance, James

ChristopherBrix commented 2 years ago

@regkirov The instances.csv file is now used to determine which combinations of models and properties to check. At the moment this file is expected to exist in the scripts directory, I'm working on making this configurable. Once that's done, I'll check all benchmarks that have been submitted so far, to make sure that their instances.csv file is correct.

I've also fixed a bug with protobuf, a new version caused everything to fail. If you tried to submit in the last days and got an error message complaining about the protobuf version, please try again.

regkirov commented 2 years ago

@stanleybak Our benchmark (Collins Aerospace) passed the automated check, following the instruction I am posting the link here: https://vnncomp.christopher-brix.de/benchmark/details/72

Do you plan to consolidate all benchmarks in a single repo? Is there a hard deadline, after which the benchmark materials can no longer be updated? (e.g., timeouts, description, models, etc.)

stanleybak commented 2 years ago

Hi all,

The benchmark submission deadline was extended one week to June 7th. Please try to make this deadline as no further extensions will be possible. The tool submission will also be moved back one week so that participants have the expected amount of time to run their tools on the benchmarks.

vkrish1 commented 2 years ago

Hi everyone, I'm about to propose a few benchmarks (involving collision avoidance, some reinforcement learning benchmark problems) -- what does instances.csv need to contain? I saw some benchmarks posted here that had a third column (in addition to the vnnlib and onnx filenames) -- is that needed? Thanks!

stanleybak commented 2 years ago

@vkrish1 the instances.csv file contains lists of benchmark instances: onnx network, vnnlib spec, timeout in seconds. The total timeout of all instances should be less than six hours. The generate python script can create the instances.csv file based on a random seed given as a commend line argument.

jferlez commented 2 years ago

FYI: the AWS validation instance is suffering from the recent incompatibility bug between tensorflow and protobuf. Basically, protobuf >=3.20 introduces changes that break TF compatibility, so attempting to import TF will fail. See e.g. https://github.com/tensorflow/tensorflow/issues/56077 and the release notes from TF 2.9.1 https://github.com/tensorflow/tensorflow/releases.

This appears to prevent import onnx from succeeding as well.

The solution is to use TF==2.9.1/2.8.2/2.6.5 etc., which should install protobuf < 3.20 as a dependency, or manually install protobuf==3.19.3 first, since it won't be upgraded by pip.

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

vkrish1 commented 2 years ago

Hi all, we'd like to propose few benchmarks: a collision-avoidance task and two from OpenAI's Gym (Edited to include 2 other benchmarks)

Dubins Rejoin - based on the SafeRL Rejoin Task benchmark (paper ref below). The system represents a flight formation problem: a network is trained via reinforcement learning to guide a wingman aircraft to a certain radius around a lead. The input and output spaces are both 8-dimensional. The 8 network outputs are translated into tuples of actuators for controlling the wingman (rudder, throttle); each discrete-valued actuator has 4 options. The control tuple is determined from the argmax of the first 4 network outputs and the argmax of the second 4, yielding 16 possible combinations. The benchmarks are designed to validate a network output with respect to one or more of the 16 combinations, which is a conjunction of 6 terms (i.e. comparing the first actuator's logit to the remaining 3 in the first group and comparing the second actuator's logit to the remaining 3 in the second group). Each benchmark is a disjunction of one or more of these 6-term conjunctions.

Dubins Rejoin environment details: U. Ravaioli, J. Cunningham, J. McCarroll, V. Gangal, K. Dunlap, and K. Hobbs, “Safe reinforcement learning benchmark environments for aerospace control systems,” in 2022 IEEE Aerospace Conference, IEEE, 2022.

CartPole - based on the OpenAI CartPole-v1 benchmark.The control network was trained using the StableBaselines3 library - the input space is 4-dimensional and the output space is binary.
LunarLander - based on the OpenAI LunarLander-v4 benchmark. The control network was trained using the StableBaselines3 library - the input space is 8-dimensional and the output space is 4-dimensional.

stanleybak commented 2 years ago

I'm also still curious about the disk space available on the instance. Have there been any deliberations about that?

@ChristopherBrix Any idea for this? Hopefully it's easy to get a bigger hard drive allocated.

Also: Could you help add a pip package for the benchmark generation? I'm try to to finalize a benchmark based on vggnet16, but I need to pip3 install mxnet to generate the benchmarks. The benchmark repo I have is here. Another potential issue is that vggnet is about 500MB so I'm not including it in the repo, but instead using wget as part of generate_benchmarks.py.

stanleybak commented 2 years ago

Hi all, here's the current list of benchmarks I see from the website:

mnist_fc
Carvana-unet
cifar_biasfield
collins-rul-cnn
SRI_ResNetA / SRI_ResNetB
TLLVerifyBench
dubins-rejoin / cartpole / lunarlander
VGGNET16
NN4SYs

If there are any others, please post a comment here describing your benchmark and try to run it on the submission site. Given the issues with the benchmark submission site (see the previous 3-5 comments), I think we can extend submissions until Friday. If things aren't fixed by then, I'll manually go through the benchmarks to add them.

huanzhang12 commented 2 years ago

@stanleybak We will propose two benchmarks, NN4SYS and a ResNet image classification benchmark.

It seems the disk space of the submission site is full, and we cannot do any testing at this time (even git clone fails). But we will post links to both benchmarks later today.

pomodoromjy commented 2 years ago

@stanleybak The latest version of our benchmark is Carvana-unet-2022 (43) instead of Carvana-unet.

huanzhang12 commented 2 years ago

We propose two benchmarks, one focusing on scaling up models, and the other focusing on applications. The benchmarks have mixed difficulty and include both small and large models.

https://vnncomp.christopher-brix.de/benchmark/details/120 (updated) https://vnncomp.christopher-brix.de/benchmark/details/143 (updated)

The NN4Sys2022 benchmark is an extension of last year’s nn4sys benchmark. Besides verifying the learned index task, similar to last year’s benchmark, we also added a learned cardinality benchmark with NN based cardinality estimation. Cardinality estimation aims at estimating the size of sub-plans of each query to a database and guiding the optimizer in a database system to select the optimal join operations. Detailed descriptions can be found on the benchmark repository: https://github.com/Cli212/VNNComp22_NN4Sys

In the CIFAR100_TinyImageNet_ResNet benchmark, we train a few models on CIFAR-100 and TinyImageNet datasets with different sizes (small, medium, large, super). These models are trained with CROWN-IBP and adversarial training. They are a few times larger than last year’s benchmarks, but are hopefully still reasonable to handle.

ChristopherBrix commented 2 years ago

The disk size has been increased to 100GB, now the benchmark can be cloned: https://vnncomp.christopher-brix.de/benchmark/details/105

The ONNX bug has been fixed (again, I was aware of that but apparently failed to actually fix it) and mxnet is installed: https://vnncomp.christopher-brix.de/benchmark/details/108

jferlez commented 2 years ago

Thanks @ChristopherBrix! Everything seems to be working for me, and our benchmark passed the validation.

Here's the link for the successful run:

https://vnncomp.christopher-brix.de/benchmark/details/112

piyush-J commented 2 years ago

@stanleybak Hi Stanley, will we be provided with a consolidated benchmarks folder like last year, or will there be a different process to access them?

stanleybak commented 2 years ago

Yes we'd like to prepare something like last year with the benchmarks, although some of the networks are fairly large so we'll need to think how to handle that.

stanleybak commented 2 years ago

As mentioned before, I'm proposing two benchmarks:

the first is based on vggnet16, an imagenet classification benchmark. The easy instances are single pixel perturbations with tiny epsilons, the harder ones include more and more pixels that can change. The passing validation for this is here.
The second one is based on networks that compute reachable set probability densities, based on this paper: Meng, Yue, Dawei Sun, Zeng Qiu, Md Tawhid Bin Waez, and Chuchu Fan. "Learning Density Distribution of Reachable States for Autonomous Systems." In Conference on Robot Learning, pp. 124-136. PMLR, 2022. The benchmark run is here.

alessandrodepalma commented 2 years ago

Hi everyone,

We would like to propose the OVAL21 benchmark from last year's competition. It is composed of medium-sized convolutional networks, tested for adversarial robustness against image-dependent perturbation radii.

The passing tests can be found here: https://vnncomp.christopher-brix.de/benchmark/details/136

Thanks for organizing!

ChristopherBrix commented 2 years ago

I've cleaned up / rewritten large parts of the code I use to test the benchmarks (and soon the toolkits). This was not easily backwards compatible, so their logs/status is no longer shown.

@mnmueller: SRI_ResNet_A (https://vnncomp.christopher-brix.de/benchmark/details/149) and SRI_ResNet_B (https://vnncomp.christopher-brix.de/benchmark/details/164) failed, because the referenced .vnnlib files do not exist. Could you please fix that and resubmit the benchmarks? Note that because the paths in instances.csv weren't consistent across benchmarks, I had to add two more form fields. For your current folder structure, you need to enter .. for both (or use the "Use data to populate submission form" link at the bottom of the failed submissions).

All other submitted benchmarks have been tested again and work fine. They've been added to one big repository: https://github.com/ChristopherBrix/vnncomp2022_benchmarks.

mnist_fc (@pat676): https://vnncomp.christopher-brix.de/benchmark/details/165
cifar_biasfield (@pat676): https://vnncomp.christopher-brix.de/benchmark/details/166
collins-rul-cnn (@regkirov): https://vnncomp.christopher-brix.de/benchmark/details/175
dubins-rejoin-task (@vkrish1): https://vnncomp.christopher-brix.de/benchmark/details/168
cartpole (@vkrish1): https://vnncomp.christopher-brix.de/benchmark/details/169
lunarlander (@vkrish1): https://vnncomp.christopher-brix.de/benchmark/details/170
NN4Sys-test6 (@huanzhang12): https://vnncomp.christopher-brix.de/benchmark/details/171 (upload to GitHub failed because even the compressed model was too big, I'll setup LFS tomorow and upload it manually)
CIFAR100_TinyImageNet_ResNet (@huanzhang12): https://vnncomp.christopher-brix.de/benchmark/details/172
TLLVerifyBench (@jferlez): https://vnncomp.christopher-brix.de/benchmark/details/173
OVAL21 (@alessandrodepalma): https://vnncomp.christopher-brix.de/benchmark/details/174

@all, please make sure your benchmark is listed here. "Carvana-unet" and "VGGNET16" were submitted multiple times, but never posted here - was that a test, or should they be added?

pomodoromjy commented 2 years ago

Hi @ChristopherBrix, Cavana-unet-2022 was not a test.

We would like to propose benchmark Carvana-unet-2022. https://vnncomp.christopher-brix.de/benchmark/details/43

Currently, most networks evaluated in the literature are focusing on image classfication. However, in many practical scenarios, e.g. autonomous driving, people pay more attention to object detection or semantic segmentation. Considering the complexity of the object detection, we propose a new set of simplified Unet (model one: four Conv2d layers followed with BN and ReLu; model two: add one AveragePool layer and one TransposedConv Upsampleing layer on the model one ). We advocate that tools should handle more practical architectures, and the simplified Unet is the first step towards this goal.

Thank you for organizing.

ChristopherBrix commented 2 years ago

Great, thanks! Could you please update https://github.com/pomodoromjy/vnn-comp-2022-Carvana-unet/blob/main/Carvana-unet_instances.csv to include the .onnx filetype in addition to the model name? Then resubmit based on this form: https://vnncomp.christopher-brix.de/benchmark/resubmit/176

pomodoromjy commented 2 years ago

Hi @ChristopherBrix, we already resubmitted our benchmark on https://vnncomp.christopher-brix.de/benchmark/details/178.

ChristopherBrix commented 2 years ago

Perfect, it was added to the repository.

mnmueller commented 2 years ago

@ChristopherBrix sorry for the delay. I did not notice, that the test does not clear all created files before running with a second seed: https://vnncomp.christopher-brix.de/benchmark/details/197 https://vnncomp.christopher-brix.de/benchmark/details/196

Given that some of the benchmarks that were submitted towards the very end of the submission period contain relatively non-standard architectures, I believe that we should extend the tool submission deadline such that all teams have one month to implement any potentially required features to support these architectures. (@stanleybak)

pat676 commented 2 years ago

Hi @huanzhang12

Would it be possible to remove the part of the model related to reshaping at the start of lindex.onnx and lindex_deep.onnx? It does not seem to serve any purpose when the input is already in the correct shape?

piyush-J commented 2 years ago

@stanleybak I agree with @mnmueller. Extending the tool deadline will be helpful.

ChristopherBrix commented 2 years ago

I noticed the timeouts in some benchmarks don't seem right. Last year, we strove for total timeouts per benchmark of 3 to 6 hours. @stanleybak, did we say something about that this year?

@vkrish1 all three of your benchmark have very small timeouts. Many are just one second. Is this intended?

cartpole (https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/cartpole/instances.csv) Total timeout 562 seconds
dubins_rejoin_task (https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/dubins_rejoin_task/instances.csv) Total timeout 1668 seconds
lunarlander (https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/lunarlander/instances.csv) Total timeout 222 seconds

@pomodoromjy carvana_unet_2022 (https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/carvana_unet_2022/instances.csv) takes a very long time. Total timeout 48000 seconds

pomodoromjy commented 2 years ago

Hi @ChristopherBrix , our benchmark takes a very long time because we have too many instances. We have updated the repository https://github.com/pomodoromjy/vnn-comp-2022-Carvana-unet, which has 64 instances and the total timeout is 19200 seconds. Could you add it to the repository?

stanleybak commented 2 years ago

@ChristopherBrix VGGNET16 and reach_prob_density should be added as well. The comment where these two are mentioned is here. The vggnet16 network is large (~500MB), so I'm not sure if you'll be able to directly add it to the github repo. Right now the benchmark does a wget to retrieve the network from the onnx model zoo.

Also: you may want to include the generation scripts in these, as we'll select a specific random seed after the tool submission deadline and regenerate all instances based on that.

@vkrish1 I believe the intent was to include cartpole, lunarlander and dubins-rejoin as a single benchmark rather than three different ones as it's done now. Can you refactor this? Also, as mentioned in one of the comments, the total timeout should be around 6 hours, so if you have small timeouts like 1 second, please increase these to make the total around 21600 seconds (6 hours 60 min/hour 60 sec/min).

@all in terms of the tool submission deadline, this will be set to be one month after benchmarks were (mostly) finalized. The remaining changes that are being made to the benchmarks are mostly things like timeouts and there should not be completely new benchmarks being proposed at this point. I'll update this in the rules/website to be July 14th.

Note benchmarks probably need more than one tool that supports them if you want them to be scored (I think we had this implicit rule last year). It's not much of a competition if only a single tool (likely the proposer) can analyze the network. Therefore if there's a simplification you can do please do that (@huanzhang12 see this comment about a potential simplification).

stanleybak / vnncomp2022

Benchmarks Discussion #2