Benchmark Proposals and Discussion

ttj commented 3 months ago

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can then provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code.

The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

ttj commented 3 months ago

As one mechanism to address benchmarks not being available until far too close to the evaluation timeframe, we have solicited benchmarks to register via the Google form registration process now, these are those submitted as of today 4/3 and discussed in rules/kickoff meeting:

https://github.com/AI4OPT/ml4acopf_benchmark

https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/ -> aggregated into a formal benchmark here: https://github.com/dynaroars/vnncomp-benchmark-generation

https://github.com/loonwerks/vnncomp2023

https://github.com/ANTONIONLP/safeNLP

For any other benchmark proposals, please comment on this issue with a link to the benchmark details

naizhengtan commented 3 months ago

We want to propose NN4SysBench (paper). Here is the repo from last year: https://github.com/Khoury-srg/VNNComp23_NN4Sys

(We will likely add a new application this year. Will update this post when done.)

kollerlukas commented 2 months ago

We would like to propose our benchmark: Verify1s (https://github.com/kollerlukas/cora-vnncomp2024-benchmark).

bwbellmath commented 2 months ago

We updated last-years ml4acopf_benchmark to offer two alternative ReLU-only options for the same models. These are provided as 6 new onnx files designed to make this benchmark more accessible to teams who have not implemented the sigmoid, sin, or cos activation functions. The new onnx files are attached and will be included in the previous github repo pending a pull request.

The main change is that each of the sigmoid, sin, and cos nodes have been replaced with ReLU based models with error bounded by 10E-3. There is a residual configuration which takes advantage of residual connections to make periodic approximations of sine and cosine and a non-residual configuration which uses minimally complex structures to emulate sine and cosine only within the region from -pi/2 to pi/2.

Edit: We did some cleanup and have further eliminated ConstantOfShape, Where, Equal, Expand, Pow, and Neg node types from the linearized models provided within https://github.com/AI4OPT/ml4acopf_benchmark.

Our updated benchmarks offer a contrast between a higher complexity nonlinear problem to push the boundaries of nonlinear verification versus two linearized models for the same problems which may produce new constraint violations.

aliabigdeli commented 2 months ago

Hello! We are considering introducing a new benchmark.

Assuming having a neural network controller approximation with a piecewise linear model in the form of a set of linear models with added noise to account for local linearization error. The objective of this benchmark is to investigate the neural network output falls within the range we obtain from our linear model output plus some uncertainty. Given input $x \in X$ and output $Y = f{NN}(x)$, the query is of the form: $A{mat}\times X + b + U{lb} \leq Y \leq A{mat}\times X + b + U_{ub}$ for each linear model in its abstraction region.

The neural network controller we used in this benchmark is an image-based controller for an Autonomous Aircraft Taxiing System whose goal is to control an aircraft’s taxiing at a steady speed on a taxiway. This network was introduced in the paper Verification of Image-based Neural Network Controllers Using Generative Models. The neural network integrates a concatenation of the cGAN (conditional GAN) and controller, resulting in a unified neural network controller with low-dimensional state inputs. In this problem, the inputs to the neural network consist of two state variables and two latent variables. The aircraft’s state is determined by its crosstrack position (p) and heading angle error (θ) with respect to the taxiway center line. Two latent variables with a range of -0.8 to 0.8 are introduced to account for environmental changes.

Because in this case the output spec depends on both the input and output and considering the VNN-LIB limitation, we added a skip-connection layer to the neural network to have the input values present in the output space. We also added one linear layer after that to create a linear equation for each local model.

The benchmark repo: https://github.com/aliabigdeli/LinearizeNN_benchmark2024

The idea of this benchmark came from one of our projects in which we approximated the NN controller with a piecewise linear model, and we wanted to check if the neural network output falls within the range we obtained from our linear model output plus some uncertainty. However, in most areas of the state space, verification by state-of-the-art tools fails (could neither verify nor falsify, and return the timeout, although I increased the timeout to 12 hr).

We would greatly appreciate any feedback, opinions, and suggestions from both competition organizers and participants. Thank you!

hocdot commented 1 month ago

As one mechanism to address benchmarks not being available until far too close to the evaluation timeframe, we have solicited benchmarks to register via the Google form registration process now, these are those submitted as of today 4/3 and discussed in rules/kickoff meeting:

https://github.com/AI4OPT/ml4acopf_benchmark

https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/

https://github.com/loonwerks/vnncomp2023

https://github.com/ANTONIONLP/safeNLP

For any other benchmark proposals, please comment on this issue with a link to the benchmark details

@ttj We submitted benchmark generation tools (https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/) but the proposed benchmark is available here (https://github.com/dynaroars/vnncomp-benchmark-generation). Could you please update it?

KaidiXu commented 1 month ago

Thank @kollerlukas for proposing the Verify1s benchmark. I have a concern regarding the 1-second timeout currently set for the competition. The same discussion happened in previous years, please refer to here. In general, based on last year’s report (see Table 7), it is evident that most verifiers require more than this 1-second threshold simply for initial loading the verifier. This overhead is unavoidable under the current rule that mandates each instance be run in independent scripts, starting anew with loading the verifier each time. While batch mode operation could potentially alleviate this overhead, it appears the competition rules remain unchanged this year, disallowing this approach. To ensure fairness and the practical applicability of the benchmark across different verification methods, I recommend increasing the timeout to at least 30 seconds as suggested in previous years, which is longer than the overhead for all verifiers. In addition, the official report subtracts this overhead from all verifiers, as shown in cactus plots, many instances can be solved in less than 1 second, reflecting each verifier's capabilities.

kollerlukas commented 1 month ago

@KaidiXu Thank you for your feedback. We were unaware that there already had been a discussion about 1s timeouts. If our understanding is correct, currently, the timeout does not consider the overhead. We would like the timeout to be only the verification time without prepare time and overhead. Is there a possibility of doing this? We could increase the timeout in the vnnlib files, e.g., to 30s, and when scoring only instances with a run time of (1s + overhead) are counted.

KaidiXu commented 1 month ago

@kollerlukas Yes, you are right. Finally, the official report will subtract overhead for all verifiers, as you can see in cactus plots, there are indeed a lot of instances that can be solved in less than 1 second for all verifiers. However, if in vnnlib files, they are only a 1-second timeout, then most verifiers will fail. I'm not sure our scoring system can support this post-processing after subtracting overhead and only counting instances that can be verified <=1s.

ttj commented 1 month ago

last call for any new benchmarks, please post by tomorrow 5/17, we will plan to make the form early next week, solicit feedback on it, then send out to tool participants for voting with a voting response deadline of May 31. We will have in the form as well the benchmarks considered last year. There are a couple repeats/modified proposals of benchmarks proposed last year re-proposed above and a few that were not scored from last year re-proposed as well, so if proposers want only want one version considered, please comment. For continuity and year-to-year progress tracking, probably we should include at least all the scored ones from last year, for reference, the last year benchmarks are here:

https://github.com/ChristopherBrix/vnncomp2023_benchmarks

and their overview/descriptions:

https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1595747102

kollerlukas commented 1 month ago

@KaidiXu We increased the timeout in the instances.csv to 30s. @ttj Is a post-processing step to only count the instances that were solved in (1s + overhead) possible?

wu-haoze commented 1 month ago

We want to re-propose the dist-shift benchmarks coming from this paper . Here is the repo: https://github.com/wu-haoze/dist-shift-vnn-comp

ttj commented 1 month ago

@KaidiXu We increased the timeout in the instances.csv to 30s. @ttj Is a post-processing step to only count the instances that were solved in (1s + overhead) possible?

@ChristopherBrix do you think that is possible ?

ttj commented 1 month ago

We have created an initial draft Google form for collecting votes here to solicit feedback on any changes or if anything missing. Please do not submit it yet, as more benchmarks may come in today, and we will finish adding the previous VNN-COMP'23 benchmarks to the form as well. If you have any feedback on the form/process for voting, please let us know by posting here or in rules issue, we will post again once the form is about finalized for any final feedback before soliciting the official votes next week from the tool participants:

https://forms.gle/PQBW7tJeYUMJ9Dm3A

shizhouxing commented 1 month ago

We would like to propose a benchmark for Lyapunov-stable neural control.

There are two models for state feedback and output feedback, respectively. Each model consists of a controller which is a shallow ReLU network, a Lyapunov function which is a quadratic function, and nonlinear operators modelling the dynamics of a 2D quadrotor. The model for output feedback further contains a shallow LeakyReLU network as the observer. The goal is to certify the Lyapunov stability, which has been encoded in the ONNX graphs and VNNLIB specifications.

Repo: https://github.com/shizhouxing/LSNC_VNNCOMP2024

xiangruzh commented 1 month ago

Hi all,

I would like to repropose the yolo benchmark from last year. The benchmark was included in VNN-COMP 2023 but unscored.

The network of the original YOLOv2 model is replaced with a much smaller network containing 5 residual blocks with conv and ReLU layers for simplicity of verification. It takes in an image (3 x 52 x 52) and outputs a tensor of a shape 125 x 13 x 13 containing the confidence scores, the classification result, and the positions of 5 bounding boxes for each of the 13 x 13 grid cells. Only 5 of the outputs are actually used in verification.

For each instance, the property is verified if the confidence score of at least one bounding box is greater than a preset threshold given any bounded perturbations on the input image (similar to other image robustness benchmarks). Note that here we only use 5 outputs related to confidence scores, even though the output tensor seems huge. An example of the specification to check is shown here:

Here’s the repo: https://github.com/xiangruzh/Yolo-Benchmark

huanzhang12 commented 1 month ago

Glad to know that we will reconsider all 2023 benchmarks! I looked into benchmarks in 2022 and 2021, and there were also many great ones. Here I re-propose the cifar100_tinyimagenet benchmark in VNN-COMP 2022.

Since we have the division of tracks this year, to make the benchmark more accessible and friendly for new tools in this field, I reduced the complexity of this benchmark. I kept only two of the four CIFAR100 models in medium sizes (this will reduce the workload to support this benchmark). The TinyImageNet models/data may be too large for new tools in this field, so I split it into a separate benchmark. I hope at least the cifar100 benchmark (with only 2 onnx files) can be voted into the main track. Cifar100 has the same image size (32x32) as the commonly used Cifar10, and the architecture contains FC, conv, and ReLU layers only. I also removed the largest “super” model, which may pose a challenge for some tools due to too many layers. Hopefully, many tools can support the cifar100 benchmark without too much additional effort.

In 2022, the benchmark had a reproducibility problem during generation, where the JPEG decoded image could be slightly different depending on installed library versions, causing unexpected issues (e.g., the same image index can be sat on one machine but unsat on another machine). So I’ve saved all images into a checkpoint and loaded them directly to avoid this issue.

Here are the re-proposed benchmarks (I split the Cifar100 and TinyImageNet models into two benchmarks as mentioned above):

https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark https://github.com/huanzhang12/vnncomp2024_tinyimagenet_benchmark

ttj commented 1 month ago

Thanks very much everyone, we have updated the voting Google form here with all the proposed benchmarks, along with those from last year (some of these are duplicated with new/revised proposals above, we have put everything in the form, so if benchmark proposers would like only 1 version considered [as eg a couple maybe were proposed last year but not scored], please post below which version to have in the form):

https://forms.gle/PQBW7tJeYUMJ9Dm3A

Please take a look and let us know of any edits suggested before we open voting, so please do not yet submit the form until we post again saying voting is open in case of any final revisions.

Votes will be due from tool participants by May 31, so please let us know of any feedback on the form/voting process ASAP and at the latest by this Friday 5/24 so that tool participants will have at least a week to submit votes

We will email the email list with this update as well in case anyone not tracking the github issue actively

bwbellmath commented 1 month ago

Ml4acopf only needs the github link, the zip file is redundant now, thanks!

On May 21, 2024, at 10:53, Taylor Johnson @.***> wrote:

Thanks very much everyone, we have updated the voting Google form here with all the proposed benchmarks, along with those from last year (some of these are duplicated with new/revised proposals above, we have put everything in the form, so if benchmark proposers would like only 1 version considered [as eg a couple maybe were proposed last year but not scored], please post below which version to have in the form):

https://forms.gle/PQBW7tJeYUMJ9Dm3A

Please take a look and let us know of any edits suggested before we open voting, so please do not yet submit the form until we post again saying voting is open in case of any final revisions.

Votes will be due from tool participants by May 31, so please let us know of any feedback on the form/voting process ASAP and at the latest by this Friday 5/24 so that tool participants will have at least a week to submit votes

We will email the email list with this update as well in case anyone not tracking the github issue actively

— Reply to this email directly, view it on GitHubhttps://github.com/verivital/vnncomp2024/issues/2#issuecomment-2123145488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJNL3MBMROX5ZRZ3ADDWHE3ZDOC2NAVCNFSM6AAAAABFUPBILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGE2DKNBYHA. You are receiving this because you commented.Message ID: @.***>

ttj commented 1 month ago

thanks @bwbellmath I have updated the form; final reminder of any feedback on the form or if we missed any benchmarks etc

KaidiXu commented 1 month ago

Hi @ttj, I found https://github.com/ANTONIONLP/safeNLP contains 10812 instances with 120s timeout for each of them which is 360h timeout in total.

huanzhang12 commented 1 month ago

@ttj It looks like some benchmarks have duplicated entries in the current voting form:

yolo by @xiangruzh: https://github.com/xiangruzh/Yolo-Benchmark dist-shift by @wu-haoze: https://github.com/wu-haoze/dist-shift-vnn-comp nn4sys by @naizhengtan: https://github.com/Khoury-srg/VNNComp23_NN4Sys

These are re-proposed benchmarks with no change compared to last year, and their github repos are not changed since last year. Shall we list them only once in the voting form, as part of VNN-COMP 2023 benchmarks? It might cause confusion that one benchmark gets voted in two duplicated entries (e.g., one entry is voted for the core track and the other entry is not).

Hongji1001 commented 1 month ago

Hi @ttj, I found the generate_properties.py file is missing in the https://github.com/ANTONIONLP/safeNLP.

ttj commented 1 month ago

thanks @KaidiXu and @Hongji1001 ; @KatyaKom @ANTONIONLP and @Tgl70: can you please see discussion above re your benchmark? it will need a little modification, note that the max benchmark time is described in the rules (6 hours) and you need the generate_properties file; I presume the way yours is set up you can modify to just select a subset of the instances and have generate properties pick random test samples/instances, please ensure you follow the guidelines re: setting a seed so we can reproduce at evaluation time

here is a representative generate_properties (thanks @huanzhang12 ), but you can also see other benchmarks for examples, and I presume you have something like this somewhere already given what is in your repository with the instances

https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark/blob/main/generate_properties.py

there may have been some confusion as what is stored in the benchmarks repository from the last iteration is the outputs from some of the benchmarks (eg just the instances after usage of generate_properties for what was actually evaluated); we will try to do a better job tracking these in the future to keep things together

@huanzhang12 thanks: yes, I will merge some of these (I had asked above but none of the benchmark proposers responded, so was not clear if exactly same or not)

I will post later today re opening voting at the google form with the modifications, we will extend voting to Monday 6/3 given the few updates, but will not extend past that, so any final updates please let us know asap

ttj commented 1 month ago

I have merged nn4sys and dist-shift duplicates in the form

There were 3 versions of YOLO last year; I cannot figure out which is the same, can someone please clarify @xiangruzh ?

Which is the https://github.com/xiangruzh/Yolo-Benchmark equivalent to?

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cctsdb_yolo

or

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo

The other was by collins so could figure that out; from the files I think it is this one, but would be good to have confirmation, I will update it to be this one in the form:

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo

xiangruzh commented 1 month ago

@ttj right, it's https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo

ttj commented 1 month ago

great, thanks @xiangruzh , we have updated the voting form

I think given no other comments at this time, let's go ahead and open voting at the Google form, please coordinate with your team and submit your votes by Monday, June 3 AOE at latest, we will post the results once in and subdivision into the 2 tracks

https://forms.gle/PQBW7tJeYUMJ9Dm3A

One team had already submitted votes without these final changes to the form; I saved a copy in case it is useful I can send, but given the changes merging a few benchmarks, please do resubmit, our apologies for the inconvenience

Tgl70 commented 1 month ago

Hi @ttj, sorry I've just seen the discussion.

I presume the way yours is set up you can modify to just select a subset of the instances and have generate properties pick random test samples/instances, please ensure you follow the guidelines re: setting a seed so we can reproduce at evaluation time

Yes you are correct, I will modify it to be in the 6 hours timeout (and follow the guidelines). I will also add the generate_properties.py file.

shizhouxing commented 1 month ago

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see https://github.com/verivital/vnncomp2024/issues/2#issue-2221794616).

Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed?

Thanks.

mldiego commented 1 month ago

Adding to @shizhouxing comment: is the generate_properties.py file allowed to depend on any of the tools used in the competition? Seems like the properties are filtered out by some "simple heuristic" determined by verification tools (neuralsat, crown). This could lead to some unfair property generation...

Tgl70 commented 1 month ago

Hi @ttj , I have uploaded the generate_properties.py file. Also, now it generates the instances.csv file with 20s timeout per query and 1080 random queries (instead of the full 10812 queries) with the seed required for reproducibility. Let me know if I need to modify it further or if it's ok now.

ttj commented 1 month ago

thanks everyone, reminder to finish voting by today June 3, some teams have not submitted votes and they are due today AOE (will aim to post results tomorrow)

for discussion above re a couple benchmarks and edits: if any final edits/revisions to benchmarks as people have looked at them in detail for voting, please finalize ASAP

ttj commented 1 month ago

We have compiled the benchmark votes in this spreadsheet:

https://docs.google.com/spreadsheets/d/1SYSnFhi9UzjBAAsxQp6k3Ox0YJn4vG8YjL6-0M7BvUI/edit?usp=sharing

The results are compiled in that spreadsheet and are as follows; please let us know if there are any mistakes

Regular Track Benchmarks https://github.com/ANTONIONLP/safeNLP https://github.com/Khoury-srg/VNNComp23_NN4Sys and https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/nn4sys https://github.com/kollerlukas/cora-vnncomp2024-benchmark https://github.com/aliabigdeli/LinearizeNN_benchmark2024 https://github.com/wu-haoze/dist-shift-vnn-comp and https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/dist_shift https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark https://github.com/huanzhang12/vnncomp2024_tinyimagenet_benchmark https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/acasxu https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/collins_rul_cnn https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/metaroom https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/tllverifybench

Unscored benchmarks All benchmarks received at least one vote, so there are no unscored benchmarks; however note several benchmarks received only one vote total

Jubengo commented 2 weeks ago

Hi all, We were wondering what is the exact timeline for the next steps:

availability of the repositery compiling all 2024 benchmarks
availability of the submission tool
final submission

shizhouxing commented 2 weeks ago

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)).

Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed?

Thanks.

Hi @hocdot @ttj @ChristopherBrix, is there any update regarding the benchmark generation issue? As @Jubengo said, the repository compiling all the benchmarks and the submission tool are still not available for this year, so the validity of the benchmarks has not been fully confirmed. Can we have these updated soon? Thanks!

ttj commented 2 weeks ago

Hi all, We were wondering what is the exact timeline for the next steps:

availability of the repositery compiling all 2024 benchmarks

availability of the submission tool

final submission

We are finishing up the submission system set up and will update soon (some AWS things needed to be finished), an issue has been created here #3 with preliminary details

Re the deadlines: we will push the deadline back one week, so July 5, from what is posted on website https://sites.google.com/view/vnn2024 , and will finish the submission system readiness ASAP; it is possible we will have a little more time to extend, but we need time to run things in advance of SAIV (and given the number of benchmarks, it is possible there is quite some compute time necessary)

shizhouxing commented 5 days ago

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)). Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed? Thanks.

Hi @hocdot @ttj @ChristopherBrix, is there any update regarding the benchmark generation issue? As @Jubengo said, the repository compiling all the benchmarks and the submission tool are still not available for this year, so the validity of the benchmarks has not been fully confirmed. Can we have these updated soon? Thanks!

Hi, @hocdot @ttj @ChristopherBrix, it has been more than a month and there is still no update regarding the validity of the benchmark. Is it still going to be included or excluded from the competition?

Hi @kollerlukas @sdemarch, since you voted for this benchmark, were you able to generate the benchmark or did you make any modification on your own?

Thanks!

ChristopherBrix commented 4 days ago

Sorry, I've finally got the website up and running again. I'm currently working my way through all benchmarks. I'll post an update today.

You'll have to create new accounts at https://vnncomp.christopher-brix.de/ I'll enable everyone this evening as soon as tool submission can start.

ChristopherBrix commented 4 days ago

I've uploaded almost all benchmarks at https://github.com/ChristopherBrix/vnncomp2024_benchmarks

Some benchmarks could not be generated correctly: cora-vnncomp2024-benchmark The MNIST dataset cannot be downloaded from yann.lecun.com This seems to be a common issue (https://stackoverflow.com/questions/66577151/http-error-when-trying-to-download-mnist-data) with several alternatives. Could you please update your benchmark accordingly?

cgan Apparently, large files can no longer be downloaded directly from Google Drive. Instead, one has to manually confirm that one wants to download it even though it couldn't be checked for viruses. Can you upload the files somewhere else? In the meantime, I'll just copy the exact benchmark from last year, but this way we cannot regenerate the instances with different seeds.

cctsdb_yolo The benchmark repository is no longer online. I'll reuse last years benchmark set, but we won't be able to generate new instances with new seeds.

vnncomp-benchmark-generation As mentioned by @shizhouxing, setting up that benchmark requires quite some additional work. I haven't been able to get that done yet.

aliabigdeli commented 4 days ago

@ChristopherBrix For cgan, could we use gdown library? because it can solve the problem.

ttj commented 4 days ago

thanks everyone!

for https://github.com/dynaroars/vnncomp-benchmark-generation if the issues cannot be resolved very soon, we will likely need to exclude it from scoring, adding other github ids that I think related as no response on it yet: @edwardxu0 @dynaroars @hocdot

we have extended the deadline for tool submission to July 12 (see https://github.com/verivital/vnncomp2024/issues/3#issuecomment-2209660333 )

ChristopherBrix commented 3 days ago

@aliabigdeli Yes, please add a requirements.txt file to your root. It will automatically be used to install all specified pip packages.

Tool submission is open! Please try to submit your tools. If you encounter any issues, feel free to ping me here, or write me an email (brix@cs.rwth-aachen.de) Currently, AWS has only granted us limited resources, so if you pick the p3 or g5 instance, you might have to wait a bit in the queue before your submission is evaluated. If you just want to make sure that your installation works as intended, and don't care about the runtime performance yet, please use m5 instances right now, we have more capacity there. I'm working with AWS to resolve this, and will post an update once it's done. I've enabled all accounts.

shizhouxing commented 3 days ago

Thanks for uploading the benchmarks, @ChristopherBrix.

large_models.zip is missing in setup.sh (2023 version for reference), so some models are missing for vggnet16_2023, nn4sys_2023, cgan_2023.

Also, in safenlp, there are subdirectories in onnx/ and vnnlib/, but not in instances.csv

Tgl70 commented 2 days ago

Hi @shizhouxing that shouldn't happen (re safeNLP), when I run generate_properties.py (https://github.com/ANTONIONLP/safeNLP/blob/main/generate_properties.py) I get the correct instances.csv with entries in the form: onnx/ruarobot/perturbations_0.onnx,vnnlib/ruarobot/hyperrectangle_2903.vnnlib,20

Is it possible that the older version was used instead?

shizhouxing commented 2 days ago

Hi @Tgl70, I saw the safenlp benchmark uploaded to ChristopherBrix/vnncomp2024_benchmarks has incorrect entries in instances.csv: https://github.com/ChristopherBrix/vnncomp2024_benchmarks/blob/main/benchmarks/safenlp/instances.csv

Tgl70 commented 2 days ago

@shizhouxing you're right! It should be just a matter of re-running generate_properties.py and substituting instances.csv. @ChristopherBrix let me know if any problem arises!

ChristopherBrix commented 1 day ago

I've added the large models from last year. I couldn't really test it right now (will do so tomorrow), but it should work.

@Tgl70 The incorrect instances.csv file was caused by some postprocessing on my end. Unfortunately, something in my CI pipeline is currently broken, and I cannot push an update to the server right now. Once that's fixed, I'll regenerate the benchmark.

shizhouxing commented 1 day ago

@ChristopherBrix There are also duplicate benchmarks in the repository: https://github.com/ChristopherBrix/vnncomp2024_benchmarks/tree/main/benchmarks/collins_yolo_robustness_2023 https://github.com/ChristopherBrix/vnncomp2024_benchmarks/tree/main/benchmarks/collins_aerospace_benchmark

ChristopherBrix commented 13 hours ago

The cora benchmark has been added. The large models should now be downloaded correctly. The safenlp benchmark is fixed. The duplicated collins benchmark has been removed.

mldiego commented 4 hours ago

@aliabigdeli

I checked the linearizenn onnx models back in the day and there was only one, which we could easily support. However, I tested on the submission site and there appears to be several models which were not available before (and that apparently MATLAB does not support). How are these models different from the original one: https://github.com/aliabigdeli/LinearizeNN_benchmark2024/tree/main/onnx ?

Is there a way to generate the new models similarly as this one for easier support?

I am aware that we voted for this benchmark to be included in the regular track, but generating unseen new models last week before submission (which are different from the ones previously available) does not seem fair. So if the models cannot be converted to onnx like the original one and participants may be having issues supporting these new models, I suggest we move it to the extended track, or only consider it in future competitions.

verivital / vnncomp2024

Benchmark Proposals and Discussion #2