Benchmark discussion - Githubissues

stanleybak commented 1 year ago

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Update: benchmark submission deadline extended to June 2 (was ~~May 29~~).

merascu commented 1 year ago

Thanks! Somehow, we were not aware of this file. Is it possible to see if any of the instances was solved within 5 minutes?

pomodoromjy commented 1 year ago

@ChristopherBrix Thanks for the update. We have changed the timeout setting to 350s per instance, thus the total timeout would be larger than 3 hours.

Z-Haoruo commented 1 year ago

@ChristopherBrix Thank you! The total timeout is set to be above 3 hours for our benchmark ml4acopf.

ChristopherBrix commented 1 year ago

The benchmarks CCTSDB-YOLO and ml4acopf have been updated in the repository. They can now be used for submitted tools. Thank you @pomodoromjy, @Z-Haoruo

@merascu: How long each instance takes to run will depend on the tools - the total timeout (sum of all timeouts of all instances in the instances.csv file) should be at most 6 hours. That way, even if a tool has a timeout on every single instance, it still finishes reasonably fast.

HanjiangHu commented 1 year ago

@mldiego Now the ONNX and VNNLIB without custom OP have been uploaded here, the input of ONNX is the image and the spec of VNNLIB is also on the pixel level after custom projection OP, as the description shows. Let me know if there are any problems.

@ChristopherBrix Thanks for the update. We have changed the timeout to 210s per instance to make the total timeout less than 6 hours.

apostovan21 commented 1 year ago

@ChristopherBrix Thanks for the update!, we have reduce the number of instance to 60, with 300s timeout for each, so the total time should be about 5 hours.

https://github.com/apostovan21/vnncomp2023/tree/master

ttj commented 1 year ago

We aim to finalize the scored benchmarks by next Friday 6/16 AOE so that tool participants may test, etc. Please post here if you are nominating a benchmark for official scoring, and refer to the guidelines/rules for further details if needed or let us know if any questions

ChristopherBrix commented 1 year ago

metaroom has been updated.

@apostovan21 Please make --new_instances the default for your script, or remove the default content of the instances.csv file, otherwise those are part of the generated benchmark and cause the total timeout to be too large.

wu-haoze commented 1 year ago

@ChristopherBrix Thanks! I've updated the benchmarks as discussed offline.

apostovan21 commented 1 year ago

@ChristopherBrix I've updated the benchmarks, hope now it should be fine!

ChristopherBrix commented 1 year ago

Confirmed, it's updated and works!

Neelanjana314 commented 1 year ago

Most of this year's benchmarks use some onnx-operators which are not directly supported by MATLAB, so we would like to nominate 2 of last year's benchmarks:

@ChristopherBrix: Do we need to add anything for the submission site?

To help us loading the models in MATLAB, is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13? We tried to load the models with onnx (python) and tried saving them in an earlier opset version, that didn't work for many of them. Could we get the original training files (pytorch, tf) or the models with opset 13, that could help us avoid some of the issues we are having loading the models in MALTAB?

stanleybak commented 1 year ago

I'm happy to nominate the two from our group:

cGAN - https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1545961792 (@feiyang-cai) VGGNET2023 - https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1540648827 (@stanleybak)

Neelanjana314 commented 1 year ago

@ChristopherBrix Could you please activate my account?

ChristopherBrix commented 1 year ago

@Neelanjana314 Done! I'm also working on setting up email notifications for this, to reduce how long it takes me to spot new accounts.

ttj commented 1 year ago

All: reminder to nominate benchmarks for scoring by tomorrow. Currently I count only 4 nominated by @stanleybak and @Neelanjana314 unless I missed any. Reminder that tool participants may nominate up to 2 benchmarks to be scored.

Z-Haoruo commented 1 year ago

Hi, we would like to nominate the benchmark ml4acopf from our group.

shizhouxing commented 1 year ago

@ChristopherBrix In the cgan benchmark, there is a cGAN_imgSz32_nCh_3_small_transformer.onnx model in instances.csv, but this model seems to be missing in the onnx folder?

feiyang-cai commented 1 year ago

@ChristopherBrix In the cgan benchmark, there is a cGAN_imgSz32_nCh_3_small_transformer.onnx model in instances.csv, but this model seems to be missing in the onnx folder?

I have reviewed my own repo and can confirm the presence of this model file. I believe the loss occurred while pushing it into the final repo. @ChristopherBrix , could you please verify it?

wu-haoze commented 1 year ago

@ChristopherBrix These two networks from the nn4sys benchmark are git-lfs objects: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d_dual.onnx.gz

The following command has no effect: git lfs pull --include="mscn_2048d.onnx"

Could some guidance be provided about how to download them?

KaidiXu commented 1 year ago

Hi @regkirov,

Thanks for your effort in proposing the Collins-YOLO-robustness benchmark. I have been attempting to load your model and vnnlib; however, I have encountered two issues that require attention.

The declaration of x end at 409599, however the range of x consists of 1228799 pairs (I guess this is due to the missing channel dimension in the inputs).
The verification property is organized by assert (or (and) (and) ...), which means if any property within the "or()" statement is satisfied, the verification result would be "sat" (counterexample found). This appears to contradict the description provided in the benchmark_description.pdf document found: here. This issue is serious since it will make the benchmark trivial (one can easily find a clean input as a "counter-example").

Considering the deadline is coming soon, if these issues can be fixed very soon, it is still possible for participants to analyze and make informed voting decisions. Otherwise, it will be very hard to determine whether the current version is suitable for this year's competition.

Neelanjana314 commented 1 year ago

@ChristopherBrix @stanleybak @ttj Today being the final day for benchmark decision, should I assume the nominated benchmarks are confirmed? Or, should I wait till the EOD for updates from the organizers?

xiangruzh commented 1 year ago

@ChristopherBrix Could you help us update the YOLO benchmark (https://github.com/xiangruzh/Yolo-Benchmark)?

We fixed issues in the vnnlib files. There was a bug with tensor flattening order. In addition, we also make the model smaller (still the same architecture) and reduce the number of properties to verify, so hopefully more tools can run our benchmark. Thanks!

regkirov commented 1 year ago

Hi @regkirov,

Thanks for your effort in proposing the Collins-YOLO-robustness benchmark. I have been attempting to load your model and vnnlib; however, I have encountered two issues that require attention.

The declaration of x end at 409599, however the range of x consists of 1228799 pairs (I guess this is due to the missing channel dimension in the inputs).

The verification property is organized by assert (or (and) (and) ...), which means if any property within the "or()" statement is satisfied, the verification result would be "sat" (counterexample found). This appears to contradict the description provided in the benchmark_description.pdf document found: here. This issue is serious since it will make the benchmark trivial (one can easily find a clean input as a "counter-example").

Considering the deadline is coming soon, if these issues can be fixed very soon, it is still possible for participants to analyze and make informed voting decisions. Otherwise, it will be very hard to determine whether the current version is suitable for this year's competition.

Hi @KaidiXu - thanks for pointing out these issues. We will check ASAP. I will keep you posted.

huanzhang12 commented 1 year ago

We (the alpha-beta-CROWN team) nominate two benchmarks from our team:

Many other proposed benchmarks are excellent, such as the metaroom and the YOLO benchmarks. These are exciting novel applications of neural network verification, and I hope the organizers can figure out a way to include all (working) NEW benchmarks (e.g., asking other participants to nominate these benchmarks). We hope to encourage benchmark proposers to work on novel applications and new benchmarks each year.

For it to count as a scored benchmark we'd want at least two tools that support it.

@stanleybak Is this rule new this year? I might miss something, but I couldn't find it in the rule documentation. I think this can discourage teams from supporting new benchmarks because if one tool is the only tool supporting some new benchmarks, the team gets no reward for their efforts. We do hope to see tools become more general and support more benchmarks in different domains, and the competition should hopefully provide incentives for this goal.

stanleybak commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

regkirov commented 1 year ago

@KaidiXu After some quick investigation of Collins-YOLO-robustness I am responding to your points:

Is now fixed
Indeed so - good catch. I have a question. We have a huge number of output constraints, because we put a constraint on every element of every output tensor. In fact, we only need to put constraints on very few outputs (concrete bounding box and its predicted class). However, we were trying to use some of the VNNComp tools and always got errors saying that all output variables must be constrained. The issue with the verification being trivial comes from setting up property negation on the outputs that must remain unchanged. Can we omit constraints on all such outputs? I can implement this quickly but I need confirmation from some tool provider that this can be done (constraining only selected output Y variables, not all).

Thanks a lot for the feedback!

regkirov commented 1 year ago

@Neelanjana314 @ChristopherBrix I see that Collins-RUL-CNN benchmark from 2022 is getting nominated. Is it going to be used? If yes, where do you plan to take it? Let me know because I may need to implement a small fix there before you proceed. Thanks.

ttj commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool

shizhouxing commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool

I think participants need to know what benchmarks will be scored ahead of the submission deadline. Otherwise some participants may unnecessarily prioritize on unscored benchmarks.

ttj commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool

I think participants need to know what benchmarks will be scored ahead of the submission deadline. Otherwise some participants may have efforts wasted on unscored benchmarks.

I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome

huanzhang12 commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool

I also agree with @stanleybak it is best to stick with the current rules document. Next year we can potentially discuss more about whether to add a new rule about it.

@ttj Actually, I believe every benchmark was already evaluated against the baseline randgen tool, since it is part of the submission process @ChristopherBrix built. So technically, as long as one other tool supports it, there are two tools supporting this benchmark. I hope this makes sense.

Honestly, I think this rule is a bit tricky because technically, one can create a dummy tool that claims to support all benchmarks but simply produces random outcomes, or a tool like randgen (which is totally legit), making this rule ineffective. In addition, this rule may discourage teams from supporting new benchmarks, as I mentioned above.

I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome

For the sake of fairness, I think everyone needs to know which benchmarks will be scored before scoring. Many organizers are also participating in the competition, so if they have the right to remove benchmarks after scoring, they would have an additional advantage. Of course, one exception is that it is reasonable to remove the benchmarks supported by only randgen and no other tools, since it would not affect the scores of any team.

ttj commented 1 year ago

@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.

I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool

I also agree with @stanleybak it is best to stick with the current rules document. Next year we can potentially discuss more about whether to add a new rule about it.

@ttj Actually, I believe every benchmark was already evaluated against the baseline randgen tool, since it is part of the submission process @ChristopherBrix built. So technically, as long as one other tool supports it, there are two tools supporting this benchmark. I hope this makes sense.

Honestly, I think this rule is a bit tricky because technically, one can create a dummy tool that claims to support all benchmarks but simply produces random outcomes, or a tool like randgen (which is totally legit), making this rule ineffective. In addition, this rule may discourage teams from supporting new benchmarks, as I mentioned above.

I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome

For the sake of fairness, I think everyone needs to know which benchmarks will be scored before scoring. Many organizers are also participating in the competition, so if they have the right to remove benchmarks after scoring, they would have an additional advantage. Of course, one exception is that it is reasonable to remove the benchmarks supported by only randgen and no other tools, since it would not affect the scores of any team.

I am on phone, so getting hard to reply as this gets longer. For the latter point, the organizers can participate, but not win, per the rules and set up for COI reasons obviously.

We will discuss and clarify over the weekend; of course, the way rules were set up, supporting as many benchmarks as possible is incentived as it increases max score, so I assume this discussion likely will be moot as it has been in the past iterations, and all benchmarks will be scored as likely at least 2 tools will consider all.

mldiego commented 1 year ago

@huanzhang12 I understand the purpose of the competition is to have a comparison between the tools. Let's imagine a scenario where every participant proposes two benchmarks that are only supported by their respective tool. The competition will have some "scores" to compare tools with, but there is no real comparison as each tool achieved their score in a different benchmark from one another. Imo, to avoid scenarios like these, having a rule where scored benchmarks should be supported by at least two tools would be beneficial for the competition.

I also disagree with @shizhouxing (waste efforts on unscored benchmarks) in the sense that it would be a waste of time to support some of the unscored benchmarks (due to limited tool support). These can still be reflected in the report and highlight the tools that support some of these benchmarks. But for the sake of having a "fair" comparison, I believe two or more tools should support the benchmarks.

shizhouxing commented 1 year ago

I also disagree with @shizhouxing (waste efforts on unscored benchmarks) in the sense that it would be a waste of time to support some of the unscored benchmarks (due to limited tool support). These can still be reflected in the report and highlight the tools that support some of these benchmarks. But for the sake of having a "fair" comparison, I believe two or more tools should support the benchmarks.

I think people should know which benchmarks are going to be scored ahead, if some benchmarks will end up being excluded. If we have plentiful time then it's fine to also support some unscored benchmark. But there are only few weeks before the submission deadline now. And many people are very busy meanwhile (e.g., it's the season for summer internships right now for many people) and have limited time to spend on this competition. We are supposed to know what we should prioritize during this short period of time.

ttj commented 1 year ago

Again, I think this discussion will be moot as it has in every other iteration, as supporting the largest number of benchmarks is incentived in the scoring. Which benchmark(s) in particular are you concerned will not be scored?

If any other tools plan to support it, they can post now to provide clarity, or if there are any other opinions or concerns please provide input, but otherwise the organizers will discuss over the weekend and clarify early next week once we see all the nominated benchmarks by AOE today

Neelanjana314 commented 1 year ago

@Neelanjana314 @ChristopherBrix I see that Collins-RUL-CNN benchmark from 2022 is getting nominated. Is it going to be used? If yes, where do you plan to take it? Let me know because I may need to implement a small fix there before you proceed. Thanks.

@regkirov We are referring to this repository, the one you used last year.

shizhouxing commented 1 year ago

Which benchmark(s) in particular are you concerned will not be scored?

If any other tools plan to support it,

I personally don't have a particular benchmark to say right now. On the contrary, if any participant/organizer is concerned of scoring some particular benchmark which they would like to exclude, I think they may post for discussions, and it would be better to make it clear to everyone in advance.

ttj commented 1 year ago

Which benchmark(s) in particular are you concerned will not be scored?

If any other tools plan to support it,

I personally don't have a particular benchmark to say right now. On the contrary, if any participant/organizer is concerned of scoring some particular benchmark which they would like to exclude, I think they may post for discussions, and it would be better to make it clear to everyone in advance.

Personally, I don't have any concerns, and as I have said repeatedly I think this discussion will be moot in the end. I believe your team raised the issue in the first place, so was presuming you have some concern regarding one of the benchmarks you proposed for scoring

huanzhang12 commented 1 year ago

For our team, I have no particular concerned benchmark. I am concerned about the approach of discouraging tools from supporting more benchmarks. It may discourage novel applications of NN verification and also disappoint non-participant and/or industrial benchmark proposers (whose benchmarks will likely require more effort to support @regkirov @pomodoromjy). By doing this, the interest in proposing novel benchmarks and finding new applications would decrease next year, and teams will be reluctant to build more general and practical tools. It doesn't sound positive to me.

As @ttj mentioned, adding this new rule would be moot, and it’s rather late to add a controversial rule not documented before. It would be better to discuss this next year instead, as @stanleybak suggested. In my opinion, adding this rule would also create the dilemma of whether a team should spend time supporting a novel benchmark. It’s a bit like gambling, and positive efforts may not be rewarded. I don’t think this dilemma helps the community.

I think the purpose of this competition is to push the boundaries of NN verification, create more publicity for this small community, and build connections with outside researchers/practitioners. I understand @mldiego’s point of “comparison”, which is one narrow aspect of the competition, but there are many benchmarks repurposed serving that role (e.g., ACASXu, Collins-RUL, VGGNet, NN4Sys). A better approach could be keeping benchmarks supported by most tools from the previous year. This is certainly much less controversial. It is also fair because all benchmarks are proposed publicly, and everyone knows which benchmark will be scored and have the equal opportunity to work on all of them.

I am actually worried that teams tend to be reluctant of supporting novel applications, and many excellent new benchmarks would not be selected under the current mechanism. Last year, the organizers did an excellent job of making sure every newly proposed benchmark was scored. I will be surprised and disappointed if this year we will take a step back and discourage teams from supporting more benchmarks.

ttj commented 1 year ago

There is no discouragement of considering additional benchmarks and it is incentivized in the scoring: the highest scoring teams over all iterations of the competition have supported the most # of benchmarks. The scoring is set up to incentivize considering more benchmarks. This mechanism of more than 1 tool supporting has been in place however from the beginning of the competition and we have discussed each year the concerns raised in @huanzhang12 post now as they have come up several times. We are sorry we forgot to add it to the rules document, but it was discussed in the meeting, posted about here, etc., and has existed in all iterations of the competition, as it is also a fairness consideration.

Any team is welcome to propose a different benchmark for scoring if they so choose, if they did not understand this mechanism previously. An alternative possibility would be as we have done in some prior iterations with similar discrepancies have arisen to say at this time we will do the scoring both ways and presenting results both ways: with all benchmarks nominated included and with only benchmarks analyzed by at least 2 tools, so this is probably the least controversial path that we will take as we have done in the past. Now, we will discuss as organizers internally and with a final recommendation early next week, as we realize time is coming up closely.

In my view, no effort is wasted on considering the proposed scored benchmarks, as a side intention of this event is so that people in papers they write consider benchmarks from this competition, and thus supporting broader benchmarks in their tools is incentive itself (e.g., as the organizers cannot win, this is why we also try to participate). We also do plan to rerun the prior iteration benchmarks as discussed also in the organizational meeting, but whether they are scored or not is based on participant nomination.

For future iterations, we will further mention, this is a volunteer event, which we hope is useful and beneficial for the community, and anyone is welcome to join the organization team in the future to help set the policy and procedures, but the organizers do have an obligation to ensure fairness from multiple perspectives. While we recognize the opinion of one team's perspective at this time (given no one else is raising a concern about this rule at this time at least), a contrasting opinion could be that certain teams may propose benchmarks they know it is likely only their tool will support, which is of course also potentially unfair, as it will help to maximize the # of benchmarks only they support. This exact argument has come up repeatedly in prior iterations of the competition and is why we have always considered at least 2 tools supporting (and in practice always happened because of the way scoring is defined).

In the scoring there are also a variety of mechanisms in place to attempt to mitigate and balance these counteracting considerations, but there is nothing perfect. So, one is more than welcome to help organize the next iteration and thus have the responsibility and obligation to help ensure fairness for all participants, including setting and interpreting rules.

Finally, for the broader point on benchmarking, there are also other mechanisms for collecting benchmarks, see e.g., this event that has been mentioned a few times, and anyone is welcome to reach out to me about (and incentivizes benchmark creation through publication). Of course it would be great if all proposed benchmarks can be considered for scoring, etc. at each iteration and I think many will try based on the nominations this year, it is not always feasible if they are based around architectures or layers for which there are no existing tools or mechanisms, or which require substantial modifications, etc. For VNN-COMP to be that mechanism for benchmark collection and curation was at least my original hope, but the shift toward more serious competition and scoring that became necessary after the 1st iteration unfortunately shifted that goal some as well.

https://aisola.org/tracks/c1/

wu-haoze commented 1 year ago

We (the Marabou team) nominate the dist-shift benchmarks and the traffic signs recognition benchmarks for scoring.

wu-haoze commented 1 year ago

@ChristopherBrix These two networks from the nn4sys benchmark are git-lfs objects: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d_dual.onnx.gz

The following command has no effect: git lfs pull --include="mscn_2048d.onnx"

Could some guidance be provided about how to download them?

@lydialin1212 , Could you provide some pointers regarding how to access your networks that are lfs objects? When I use git lfs pull I got an error:

batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

naizhengtan commented 1 year ago

@ttj @huanzhang12 @stanleybak Thanks for the discussion. This is a good conversation. As a user of NN verifiers and a researcher applying NN-Verification in my research, I do believe providing diverse benchmarks will significantly improve the visibility and usability of NN verifiers in other areas. Seriously benchmarking (and scoring) today's verifiers on relevant and real-world benchmarks will let people understand the state-of-the-art of the verifiers and lower the burden of whoever trying to use these verifiers. So, I definitely vote for scoring the benchmarks that are real and are backed by applications/systems/papers/real-world usage.

huanzhang12 commented 1 year ago

Thank you @ttj for this detailed discussion. I wanted to discuss this undocumented rule because this year’s situation is changing. Unlike previous years, this year there are a few fairly complex benchmarks, such as those proposed by @regkirov, @pomodoromjy, @apostovan21. They require a significant amount of effort to support, especially since there are only about two weeks before the submission deadline. It is not unlikely that one team put in a significant effort to support one of these benchmarks while other teams do not get a chance to work on this particular benchmark. There could also be two or more teams each working on a different hard benchmark, and they end up getting no reward, at all. This does not sound right to me.

Why adding the rule sounds bad?

I can predict more interest in this competition and more complex and challenging benchmarks will be proposed in future years (which is also what we all hope for), especially when people from the industry are more aware of this field and willing to try their applications. Next year we may see more benchmarks requiring significant efforts to support, and each one can only be realistically supported by 1 - 2 teams. Adding this rule provides increasing negative feedback for both benchmark proposers and participating teams to support novel applications in future years.
There are a few mentionings that the rule is for fairness. There might be an adversarial setting, where each team proposes very obscured benchmarks that only work for their own tool and there is no way or it does not make sense for other tools to support them at all. However, at this stage, all benchmarks proposed are realistic and they are all great, and it is clearly not the case. The rule does not serve its original purpose at all. Instead, I see this rule as being unfair to the teams who spend a lot of time spending challenging benchmarks.
Under this rule, we don’t even know which benchmark will be scored. The outcome of the competition becomes random. The difference between one benchmark being scored or not can make a big impact on the final ranking. If there are 2 hard benchmarks, let’s say team A picks up benchmark 1 and team B and C happen to both support benchmark 2, then it is very likely that team A will not win the competition, even if they work equally hard as team B and C. If Team B appens to pick up benchmark 1 instead, then Team C will lose. The competition works better if it is a fair evaluation of each team’s effort rather than a gambling game.

In my view, no effort is wasted on considering the proposed scored benchmarks, as a side intention of this event is so that people in papers they write consider benchmarks from this competition, and thus supporting broader benchmarks in their tools is incentive itself

Given the current situation in academia, it is very hard to convince a student to work on coding and engineering without external incentives. I can’t convince a student to work on something that cannot be published and cannot earn the student any honor. I can’t tell the student to work on this just to “support broader benchmarks”; they would simply walk away. I hope this competition provides this incentive to support tool development in academia, but this particular rule change will discourage students from participating in future competitions because they feel their efforts might be wasted. This is exactly what @shizhouxing (a student in my team) said.

a contrasting opinion could be that certain teams may propose benchmarks they know it is likely only their tool will support, which is of course also potentially unfair, as it will help to maximize the # of benchmarks only they support. This exact argument has come up repeatedly in prior iterations of the competition and is why we have always considered at least 2 tools supporting (and in practice always happened because of the way scoring is defined).

I completely agree organizers need to maintain the fairness of the competition, and that’s why this rule was discussed. However, what the rule was designed to prevent is not happening at all, but its side effects of making the competition unfair and random, hindering novel applications, and discouraging student participation are obvious. I see no clear benefits to implementing this rule at all, and I hope my reasons make sense to you.

lydialin1212 commented 1 year ago

@anwu1219 Hi, most parts of our benchmark are the same as the last year. We are very willing to provide you with our models.

For mscn_2048d.onnx model, you can download it from last year's VNN repo https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz and we have also updated our repo to include the model, you can find it here: https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx

For the mscn_2048d_dual.onnx model, as it's larger than 100MB, we have to store with lfs, you can find lfs link in this repo: https://github.com/Khoury-srg/VNNComp22_NN4Sys/blob/master/model/mscn_2048d_dual.onnx or https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx I think the first one is not over its data quota currently.

Or google drives links: mscn_2048d.onnx , mscn_2048d_dual.onnx

Hope it helps!

merascu commented 1 year ago

Hi all!

We (@apostovan21 and I) submitted the traffic signs benchmark to see if some of the tools could handle the layers (binarized convolutions, batch normalization, max pooling, fully connected) and the robustness properties to be verified. And to see if it's of interest to the community. We didn't even think our benchmark won't be even considered in evaluation. :(

I thought this competition is about advancing the state-of-the-art. Why not all proposed benchmarks are included in the evaluation (and scoring?) to see what the competing tools can do or not?

ttj commented 1 year ago

We (@apostovan21 and I) submitted the traffic signs benchmark to see if some of the tools could handle the layers (binarized convolutions, batch normalization, max pooling, fully connected) and the robustness properties to be verified. And to see if it's of interest to the community. We didn't even think our benchmark won't be even considered in evaluation. :(

I thought this competition is about advancing the state-of-the-art. Why not all proposed benchmarks are included in the evaluation (and scoring?) to see what the competing tools can do or not?

The benchmark certainly will be considered in the evaluation and presented in the report, presentation, etc., the scoring is based on nomination by the tool participants though. The inclusion of all and pushing boundaries is certainly the goal, but as it is a competition, there are procedures for the scoring of the competition part, in particular to attempt to prevent bad behavior on the part of participants, some of whom are highly motivated to win the competition in spite of the broader considerations of the community. As alluded in some earlier posts, the goal of the organizers was for this to be a friendly competition, but there unfortunately has been some bad behavior in the past that necessitated attempting to make things fair with the way the rules and scoring are set up. From the rules document, this is the procedure regarding scoring:

https://docs.google.com/document/d/1oF2pJd0S2mMIj5zf3EVbpCsHDNs8Nb4C1EcqQyyHA40/

"Each tool’s group is allowed to nominate two benchmarks to be used for scoring. The suggestion is to propose one internally, and nominate one from an outside group, although since we don’t know how many external benchmarks there will be, it’s allowed to simply propose two benchmarks. Please propose benchmarks that are different in some ways, though, rather than just duplicating a benchmark your tool works well on in order to maximize score.

Non-tool participants (such as industry groups interested in using verification tools) can also propose benchmarks. To count for scoring a participating tool must select the benchmark, however."

Now, it is quite likely benchmarks may be considered for scoring in future iterations if they are not scored right now, and all will be cited in the report regardless, overviewed in the presentation, etc. Hopefully some participants may have time to consider all the benchmarks, but we cannot make anyone do anything and it is up to participants based on their own time, constraints, etc. We are of course happy to discuss further, etc.

As it currently stands, I only see it seems 4 tools that have nominated benchmarks for scoring, which is quite a bit less than prior years (there are something like 20 tools registered). We will send an email out as perhaps some have not been monitoring the git issues that may alleviate some of the currently discussed concerns.

ttj commented 1 year ago

I have compiled the currently nominated benchmarks to be scored, they are as follows; if I have missed anything, please let us know. I have emailed the listserv as well just now, as there are ~20 tools registered, but only 5 7 have nominated for scoring at this time, as well as to solicit further feedback on the scoring discussion. We are extending the deadline for nomination of scored benchmarks to tomorrow AOE end of day based on the current status, as some may have not been following the github issues.

AlphaBetaCrown ViT: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vit NN4sys 2023: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/nn4sys

Marabou dist-shift benchmarks: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/dist_shift traffic signs recognition: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition

NNV (@ChristopherBrix please pull in to 2023 repository based on discussions above): AcasXu: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/acasxu Collins-RUL: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/collins_rul_cnn

nnenum cGAN: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan VGGNET2023: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vggnet16

GravityNN ml4acopf: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf

FastBATLLNN (@ChristopherBrix I did not see in the repository at your link, please update / confirm if identical to 2022 version or not) tllverifybench: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/tllverifybench

DPNeurifyFV AcasXu (previously nominated by NNV): https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/acasxu cGAN (previously nominated by nnenum): https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan

ttj update 6/20/2023 12:15pm eastern US: added FastBATLLNN and DPNeurifyFV

ttj update 6/17/2023 4:25pm eastern US: added gravityNN nominating ml4acopf

@Z-Haoruo I could not find which tool you are participating with, can you please say which tool you are nominating on behalf of for the ml4acopf benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf )?

phK3 commented 1 year ago

Is there an overview of the benchmarks available for nomination? Are benchmarks from previous editions of VNN-COMP automatically available?

stanleybak / vnncomp2023

Benchmark discussion #2