Open stanleybak opened 1 year ago
Thanks! Somehow, we were not aware of this file. Is it possible to see if any of the instances was solved within 5 minutes?
@ChristopherBrix Thanks for the update. We have changed the timeout setting to 350s per instance, thus the total timeout would be larger than 3 hours.
@ChristopherBrix Thank you! The total timeout is set to be above 3 hours for our benchmark ml4acopf
.
The benchmarks CCTSDB-YOLO
and ml4acopf
have been updated in the repository. They can now be used for submitted tools. Thank you @pomodoromjy, @Z-Haoruo
@merascu: How long each instance takes to run will depend on the tools - the total timeout (sum of all timeouts of all instances in the instances.csv
file) should be at most 6 hours. That way, even if a tool has a timeout on every single instance, it still finishes reasonably fast.
@mldiego Now the ONNX and VNNLIB without custom OP have been uploaded here, the input of ONNX is the image and the spec of VNNLIB is also on the pixel level after custom projection OP, as the description shows. Let me know if there are any problems.
@ChristopherBrix Thanks for the update. We have changed the timeout to 210s per instance to make the total timeout less than 6 hours.
@ChristopherBrix Thanks for the update!, we have reduce the number of instance to 60, with 300s timeout for each, so the total time should be about 5 hours.
We aim to finalize the scored benchmarks by next Friday 6/16 AOE so that tool participants may test, etc. Please post here if you are nominating a benchmark for official scoring, and refer to the guidelines/rules for further details if needed or let us know if any questions
metaroom
has been updated.
@apostovan21 Please make --new_instances
the default for your script, or remove the default content of the instances.csv
file, otherwise those are part of the generated benchmark and cause the total timeout to be too large.
@ChristopherBrix Thanks! I've updated the benchmarks as discussed offline.
@ChristopherBrix I've updated the benchmarks, hope now it should be fine!
Confirmed, it's updated and works!
Most of this year's benchmarks use some onnx-operators which are not directly supported by MATLAB, so we would like to nominate 2 of last year's benchmarks:
@ChristopherBrix: Do we need to add anything for the submission site?
To help us loading the models in MATLAB, is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13? We tried to load the models with onnx (python) and tried saving them in an earlier opset version, that didn't work for many of them. Could we get the original training files (pytorch, tf) or the models with opset 13, that could help us avoid some of the issues we are having loading the models in MALTAB?
I'm happy to nominate the two from our group:
cGAN - https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1545961792 (@feiyang-cai) VGGNET2023 - https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1540648827 (@stanleybak)
@ChristopherBrix Could you please activate my account?
@Neelanjana314 Done! I'm also working on setting up email notifications for this, to reduce how long it takes me to spot new accounts.
All: reminder to nominate benchmarks for scoring by tomorrow. Currently I count only 4 nominated by @stanleybak and @Neelanjana314 unless I missed any. Reminder that tool participants may nominate up to 2 benchmarks to be scored.
@ChristopherBrix In the cgan benchmark, there is a cGAN_imgSz32_nCh_3_small_transformer.onnx
model in instances.csv, but this model seems to be missing in the onnx folder?
@ChristopherBrix In the cgan benchmark, there is a
cGAN_imgSz32_nCh_3_small_transformer.onnx
model in instances.csv, but this model seems to be missing in the onnx folder?
I have reviewed my own repo and can confirm the presence of this model file. I believe the loss occurred while pushing it into the final repo. @ChristopherBrix , could you please verify it?
@ChristopherBrix These two networks from the nn4sys benchmark are git-lfs objects: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d_dual.onnx.gz
The following command has no effect:
git lfs pull --include="mscn_2048d.onnx"
Could some guidance be provided about how to download them?
Hi @regkirov,
Thanks for your effort in proposing the Collins-YOLO-robustness benchmark. I have been attempting to load your model and vnnlib; however, I have encountered two issues that require attention.
The declaration of x end at 409599, however the range of x consists of 1228799 pairs (I guess this is due to the missing channel dimension in the inputs).
The verification property is organized by assert (or (and) (and) ...), which means if any property within the "or()" statement is satisfied, the verification result would be "sat" (counterexample found). This appears to contradict the description provided in the benchmark_description.pdf document found: here. This issue is serious since it will make the benchmark trivial (one can easily find a clean input as a "counter-example").
Considering the deadline is coming soon, if these issues can be fixed very soon, it is still possible for participants to analyze and make informed voting decisions. Otherwise, it will be very hard to determine whether the current version is suitable for this year's competition.
@ChristopherBrix @stanleybak @ttj Today being the final day for benchmark decision, should I assume the nominated benchmarks are confirmed? Or, should I wait till the EOD for updates from the organizers?
@ChristopherBrix Could you help us update the YOLO benchmark (https://github.com/xiangruzh/Yolo-Benchmark)?
We fixed issues in the vnnlib files. There was a bug with tensor flattening order. In addition, we also make the model smaller (still the same architecture) and reduce the number of properties to verify, so hopefully more tools can run our benchmark. Thanks!
Hi @regkirov,
Thanks for your effort in proposing the Collins-YOLO-robustness benchmark. I have been attempting to load your model and vnnlib; however, I have encountered two issues that require attention.
- The declaration of x end at 409599, however the range of x consists of 1228799 pairs (I guess this is due to the missing channel dimension in the inputs).
- The verification property is organized by assert (or (and) (and) ...), which means if any property within the "or()" statement is satisfied, the verification result would be "sat" (counterexample found). This appears to contradict the description provided in the benchmark_description.pdf document found: here. This issue is serious since it will make the benchmark trivial (one can easily find a clean input as a "counter-example").
Considering the deadline is coming soon, if these issues can be fixed very soon, it is still possible for participants to analyze and make informed voting decisions. Otherwise, it will be very hard to determine whether the current version is suitable for this year's competition.
Hi @KaidiXu - thanks for pointing out these issues. We will check ASAP. I will keep you posted.
We (the alpha-beta-CROWN team) nominate two benchmarks from our team:
Many other proposed benchmarks are excellent, such as the metaroom and the YOLO benchmarks. These are exciting novel applications of neural network verification, and I hope the organizers can figure out a way to include all (working) NEW benchmarks (e.g., asking other participants to nominate these benchmarks). We hope to encourage benchmark proposers to work on novel applications and new benchmarks each year.
For it to count as a scored benchmark we'd want at least two tools that support it.
@stanleybak Is this rule new this year? I might miss something, but I couldn't find it in the rule documentation. I think this can discourage teams from supporting new benchmarks because if one tool is the only tool supporting some new benchmarks, the team gets no reward for their efforts. We do hope to see tools become more general and support more benchmarks in different domains, and the competition should hopefully provide incentives for this goal.
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
@KaidiXu After some quick investigation of Collins-YOLO-robustness I am responding to your points:
Thanks a lot for the feedback!
@Neelanjana314 @ChristopherBrix I see that Collins-RUL-CNN benchmark from 2022 is getting nominated. Is it going to be used? If yes, where do you plan to take it? Let me know because I may need to implement a small fix there before you proceed. Thanks.
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool
I think participants need to know what benchmarks will be scored ahead of the submission deadline. Otherwise some participants may unnecessarily prioritize on unscored benchmarks.
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool
I think participants need to know what benchmarks will be scored ahead of the submission deadline. Otherwise some participants may have efforts wasted on unscored benchmarks.
I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool
I also agree with @stanleybak it is best to stick with the current rules document. Next year we can potentially discuss more about whether to add a new rule about it.
@ttj Actually, I believe every benchmark was already evaluated against the baseline randgen
tool, since it is part of the submission process @ChristopherBrix built. So technically, as long as one other tool supports it, there are two tools supporting this benchmark. I hope this makes sense.
Honestly, I think this rule is a bit tricky because technically, one can create a dummy tool that claims to support all benchmarks but simply produces random outcomes, or a tool like randgen
(which is totally legit), making this rule ineffective. In addition, this rule may discourage teams from supporting new benchmarks, as I mentioned above.
I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome
For the sake of fairness, I think everyone needs to know which benchmarks will be scored before scoring. Many organizers are also participating in the competition, so if they have the right to remove benchmarks after scoring, they would have an additional advantage. Of course, one exception is that it is reasonable to remove the benchmarks supported by only randgen
and no other tools, since it would not affect the scores of any team.
@huanzhang12 You're right, I don't see it in the rules document this year. I had thought we had this last year as well. It's probably best to stick with the rules document in that case and address this next time.
I think we need to discuss this as organizers. We have never scored a benchmark only supported by 1 tool, and we did discuss this in the organizational meeting and in the benchmark discussion. So I suggest it must be supported by more than 1 tool
I also agree with @stanleybak it is best to stick with the current rules document. Next year we can potentially discuss more about whether to add a new rule about it.
@ttj Actually, I believe every benchmark was already evaluated against the baseline
randgen
tool, since it is part of the submission process @ChristopherBrix built. So technically, as long as one other tool supports it, there are two tools supporting this benchmark. I hope this makes sense.Honestly, I think this rule is a bit tricky because technically, one can create a dummy tool that claims to support all benchmarks but simply produces random outcomes, or a tool like
randgen
(which is totally legit), making this rule ineffective. In addition, this rule may discourage teams from supporting new benchmarks, as I mentioned above.I think in essence, it is possible this may not be known until tools are submitted and scoring is done: it is possible if no other tool supports it, it would have been nominated, but unscored. In practice this has not happened though, as all iterations of the competition at least 2 tools have analyzed every benchmark. So, it is fine for it to be nominated now, but know this could be the outcome
For the sake of fairness, I think everyone needs to know which benchmarks will be scored before scoring. Many organizers are also participating in the competition, so if they have the right to remove benchmarks after scoring, they would have an additional advantage. Of course, one exception is that it is reasonable to remove the benchmarks supported by only
randgen
and no other tools, since it would not affect the scores of any team.
I am on phone, so getting hard to reply as this gets longer. For the latter point, the organizers can participate, but not win, per the rules and set up for COI reasons obviously.
We will discuss and clarify over the weekend; of course, the way rules were set up, supporting as many benchmarks as possible is incentived as it increases max score, so I assume this discussion likely will be moot as it has been in the past iterations, and all benchmarks will be scored as likely at least 2 tools will consider all.
@huanzhang12 I understand the purpose of the competition is to have a comparison between the tools. Let's imagine a scenario where every participant proposes two benchmarks that are only supported by their respective tool. The competition will have some "scores" to compare tools with, but there is no real comparison as each tool achieved their score in a different benchmark from one another. Imo, to avoid scenarios like these, having a rule where scored benchmarks should be supported by at least two tools would be beneficial for the competition.
I also disagree with @shizhouxing (waste efforts on unscored benchmarks) in the sense that it would be a waste of time to support some of the unscored benchmarks (due to limited tool support). These can still be reflected in the report and highlight the tools that support some of these benchmarks. But for the sake of having a "fair" comparison, I believe two or more tools should support the benchmarks.
I also disagree with @shizhouxing (waste efforts on unscored benchmarks) in the sense that it would be a waste of time to support some of the unscored benchmarks (due to limited tool support). These can still be reflected in the report and highlight the tools that support some of these benchmarks. But for the sake of having a "fair" comparison, I believe two or more tools should support the benchmarks.
I think people should know which benchmarks are going to be scored ahead, if some benchmarks will end up being excluded. If we have plentiful time then it's fine to also support some unscored benchmark. But there are only few weeks before the submission deadline now. And many people are very busy meanwhile (e.g., it's the season for summer internships right now for many people) and have limited time to spend on this competition. We are supposed to know what we should prioritize during this short period of time.
Again, I think this discussion will be moot as it has in every other iteration, as supporting the largest number of benchmarks is incentived in the scoring. Which benchmark(s) in particular are you concerned will not be scored?
If any other tools plan to support it, they can post now to provide clarity, or if there are any other opinions or concerns please provide input, but otherwise the organizers will discuss over the weekend and clarify early next week once we see all the nominated benchmarks by AOE today
@Neelanjana314 @ChristopherBrix I see that Collins-RUL-CNN benchmark from 2022 is getting nominated. Is it going to be used? If yes, where do you plan to take it? Let me know because I may need to implement a small fix there before you proceed. Thanks.
@regkirov We are referring to this repository, the one you used last year.
Which benchmark(s) in particular are you concerned will not be scored?
If any other tools plan to support it,
I personally don't have a particular benchmark to say right now. On the contrary, if any participant/organizer is concerned of scoring some particular benchmark which they would like to exclude, I think they may post for discussions, and it would be better to make it clear to everyone in advance.
Which benchmark(s) in particular are you concerned will not be scored?
If any other tools plan to support it,
I personally don't have a particular benchmark to say right now. On the contrary, if any participant/organizer is concerned of scoring some particular benchmark which they would like to exclude, I think they may post for discussions, and it would be better to make it clear to everyone in advance.
Personally, I don't have any concerns, and as I have said repeatedly I think this discussion will be moot in the end. I believe your team raised the issue in the first place, so was presuming you have some concern regarding one of the benchmarks you proposed for scoring
For our team, I have no particular concerned benchmark. I am concerned about the approach of discouraging tools from supporting more benchmarks. It may discourage novel applications of NN verification and also disappoint non-participant and/or industrial benchmark proposers (whose benchmarks will likely require more effort to support @regkirov @pomodoromjy). By doing this, the interest in proposing novel benchmarks and finding new applications would decrease next year, and teams will be reluctant to build more general and practical tools. It doesn't sound positive to me.
As @ttj mentioned, adding this new rule would be moot, and it’s rather late to add a controversial rule not documented before. It would be better to discuss this next year instead, as @stanleybak suggested. In my opinion, adding this rule would also create the dilemma of whether a team should spend time supporting a novel benchmark. It’s a bit like gambling, and positive efforts may not be rewarded. I don’t think this dilemma helps the community.
I think the purpose of this competition is to push the boundaries of NN verification, create more publicity for this small community, and build connections with outside researchers/practitioners. I understand @mldiego’s point of “comparison”, which is one narrow aspect of the competition, but there are many benchmarks repurposed serving that role (e.g., ACASXu, Collins-RUL, VGGNet, NN4Sys). A better approach could be keeping benchmarks supported by most tools from the previous year. This is certainly much less controversial. It is also fair because all benchmarks are proposed publicly, and everyone knows which benchmark will be scored and have the equal opportunity to work on all of them.
I am actually worried that teams tend to be reluctant of supporting novel applications, and many excellent new benchmarks would not be selected under the current mechanism. Last year, the organizers did an excellent job of making sure every newly proposed benchmark was scored. I will be surprised and disappointed if this year we will take a step back and discourage teams from supporting more benchmarks.
There is no discouragement of considering additional benchmarks and it is incentivized in the scoring: the highest scoring teams over all iterations of the competition have supported the most # of benchmarks. The scoring is set up to incentivize considering more benchmarks. This mechanism of more than 1 tool supporting has been in place however from the beginning of the competition and we have discussed each year the concerns raised in @huanzhang12 post now as they have come up several times. We are sorry we forgot to add it to the rules document, but it was discussed in the meeting, posted about here, etc., and has existed in all iterations of the competition, as it is also a fairness consideration.
Any team is welcome to propose a different benchmark for scoring if they so choose, if they did not understand this mechanism previously. An alternative possibility would be as we have done in some prior iterations with similar discrepancies have arisen to say at this time we will do the scoring both ways and presenting results both ways: with all benchmarks nominated included and with only benchmarks analyzed by at least 2 tools, so this is probably the least controversial path that we will take as we have done in the past. Now, we will discuss as organizers internally and with a final recommendation early next week, as we realize time is coming up closely.
In my view, no effort is wasted on considering the proposed scored benchmarks, as a side intention of this event is so that people in papers they write consider benchmarks from this competition, and thus supporting broader benchmarks in their tools is incentive itself (e.g., as the organizers cannot win, this is why we also try to participate). We also do plan to rerun the prior iteration benchmarks as discussed also in the organizational meeting, but whether they are scored or not is based on participant nomination.
For future iterations, we will further mention, this is a volunteer event, which we hope is useful and beneficial for the community, and anyone is welcome to join the organization team in the future to help set the policy and procedures, but the organizers do have an obligation to ensure fairness from multiple perspectives. While we recognize the opinion of one team's perspective at this time (given no one else is raising a concern about this rule at this time at least), a contrasting opinion could be that certain teams may propose benchmarks they know it is likely only their tool will support, which is of course also potentially unfair, as it will help to maximize the # of benchmarks only they support. This exact argument has come up repeatedly in prior iterations of the competition and is why we have always considered at least 2 tools supporting (and in practice always happened because of the way scoring is defined).
In the scoring there are also a variety of mechanisms in place to attempt to mitigate and balance these counteracting considerations, but there is nothing perfect. So, one is more than welcome to help organize the next iteration and thus have the responsibility and obligation to help ensure fairness for all participants, including setting and interpreting rules.
Finally, for the broader point on benchmarking, there are also other mechanisms for collecting benchmarks, see e.g., this event that has been mentioned a few times, and anyone is welcome to reach out to me about (and incentivizes benchmark creation through publication). Of course it would be great if all proposed benchmarks can be considered for scoring, etc. at each iteration and I think many will try based on the nominations this year, it is not always feasible if they are based around architectures or layers for which there are no existing tools or mechanisms, or which require substantial modifications, etc. For VNN-COMP to be that mechanism for benchmark collection and curation was at least my original hope, but the shift toward more serious competition and scoring that became necessary after the 1st iteration unfortunately shifted that goal some as well.
We (the Marabou team) nominate the dist-shift benchmarks and the traffic signs recognition benchmarks for scoring.
@ChristopherBrix These two networks from the nn4sys benchmark are git-lfs objects: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d_dual.onnx.gz
The following command has no effect:
git lfs pull --include="mscn_2048d.onnx"
Could some guidance be provided about how to download them?
@lydialin1212 , Could you provide some pointers regarding how to access your networks that are lfs objects? When I use git lfs pull
I got an error:
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
@ttj @huanzhang12 @stanleybak Thanks for the discussion. This is a good conversation. As a user of NN verifiers and a researcher applying NN-Verification in my research, I do believe providing diverse benchmarks will significantly improve the visibility and usability of NN verifiers in other areas. Seriously benchmarking (and scoring) today's verifiers on relevant and real-world benchmarks will let people understand the state-of-the-art of the verifiers and lower the burden of whoever trying to use these verifiers. So, I definitely vote for scoring the benchmarks that are real and are backed by applications/systems/papers/real-world usage.
Thank you @ttj for this detailed discussion. I wanted to discuss this undocumented rule because this year’s situation is changing. Unlike previous years, this year there are a few fairly complex benchmarks, such as those proposed by @regkirov, @pomodoromjy, @apostovan21. They require a significant amount of effort to support, especially since there are only about two weeks before the submission deadline. It is not unlikely that one team put in a significant effort to support one of these benchmarks while other teams do not get a chance to work on this particular benchmark. There could also be two or more teams each working on a different hard benchmark, and they end up getting no reward, at all. This does not sound right to me.
Why adding the rule sounds bad?
I can predict more interest in this competition and more complex and challenging benchmarks will be proposed in future years (which is also what we all hope for), especially when people from the industry are more aware of this field and willing to try their applications. Next year we may see more benchmarks requiring significant efforts to support, and each one can only be realistically supported by 1 - 2 teams. Adding this rule provides increasing negative feedback for both benchmark proposers and participating teams to support novel applications in future years.
There are a few mentionings that the rule is for fairness. There might be an adversarial setting, where each team proposes very obscured benchmarks that only work for their own tool and there is no way or it does not make sense for other tools to support them at all. However, at this stage, all benchmarks proposed are realistic and they are all great, and it is clearly not the case. The rule does not serve its original purpose at all. Instead, I see this rule as being unfair to the teams who spend a lot of time spending challenging benchmarks.
Under this rule, we don’t even know which benchmark will be scored. The outcome of the competition becomes random. The difference between one benchmark being scored or not can make a big impact on the final ranking. If there are 2 hard benchmarks, let’s say team A picks up benchmark 1 and team B and C happen to both support benchmark 2, then it is very likely that team A will not win the competition, even if they work equally hard as team B and C. If Team B appens to pick up benchmark 1 instead, then Team C will lose. The competition works better if it is a fair evaluation of each team’s effort rather than a gambling game.
In my view, no effort is wasted on considering the proposed scored benchmarks, as a side intention of this event is so that people in papers they write consider benchmarks from this competition, and thus supporting broader benchmarks in their tools is incentive itself
a contrasting opinion could be that certain teams may propose benchmarks they know it is likely only their tool will support, which is of course also potentially unfair, as it will help to maximize the # of benchmarks only they support. This exact argument has come up repeatedly in prior iterations of the competition and is why we have always considered at least 2 tools supporting (and in practice always happened because of the way scoring is defined).
I completely agree organizers need to maintain the fairness of the competition, and that’s why this rule was discussed. However, what the rule was designed to prevent is not happening at all, but its side effects of making the competition unfair and random, hindering novel applications, and discouraging student participation are obvious. I see no clear benefits to implementing this rule at all, and I hope my reasons make sense to you.
@anwu1219 Hi, most parts of our benchmark are the same as the last year. We are very willing to provide you with our models.
For mscn_2048d.onnx model, you can download it from last year's VNN repo https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz and we have also updated our repo to include the model, you can find it here: https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx
For the mscn_2048d_dual.onnx model, as it's larger than 100MB, we have to store with lfs, you can find lfs link in this repo: https://github.com/Khoury-srg/VNNComp22_NN4Sys/blob/master/model/mscn_2048d_dual.onnx or https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx I think the first one is not over its data quota currently.
Or google drives links: mscn_2048d.onnx , mscn_2048d_dual.onnx
Hope it helps!
Hi all!
We (@apostovan21 and I) submitted the traffic signs benchmark to see if some of the tools could handle the layers (binarized convolutions, batch normalization, max pooling, fully connected) and the robustness properties to be verified. And to see if it's of interest to the community. We didn't even think our benchmark won't be even considered in evaluation. :(
I thought this competition is about advancing the state-of-the-art. Why not all proposed benchmarks are included in the evaluation (and scoring?) to see what the competing tools can do or not?
We (@apostovan21 and I) submitted the traffic signs benchmark to see if some of the tools could handle the layers (binarized convolutions, batch normalization, max pooling, fully connected) and the robustness properties to be verified. And to see if it's of interest to the community. We didn't even think our benchmark won't be even considered in evaluation. :(
I thought this competition is about advancing the state-of-the-art. Why not all proposed benchmarks are included in the evaluation (and scoring?) to see what the competing tools can do or not?
The benchmark certainly will be considered in the evaluation and presented in the report, presentation, etc., the scoring is based on nomination by the tool participants though. The inclusion of all and pushing boundaries is certainly the goal, but as it is a competition, there are procedures for the scoring of the competition part, in particular to attempt to prevent bad behavior on the part of participants, some of whom are highly motivated to win the competition in spite of the broader considerations of the community. As alluded in some earlier posts, the goal of the organizers was for this to be a friendly competition, but there unfortunately has been some bad behavior in the past that necessitated attempting to make things fair with the way the rules and scoring are set up. From the rules document, this is the procedure regarding scoring:
https://docs.google.com/document/d/1oF2pJd0S2mMIj5zf3EVbpCsHDNs8Nb4C1EcqQyyHA40/
"Each tool’s group is allowed to nominate two benchmarks to be used for scoring. The suggestion is to propose one internally, and nominate one from an outside group, although since we don’t know how many external benchmarks there will be, it’s allowed to simply propose two benchmarks. Please propose benchmarks that are different in some ways, though, rather than just duplicating a benchmark your tool works well on in order to maximize score.
Non-tool participants (such as industry groups interested in using verification tools) can also propose benchmarks. To count for scoring a participating tool must select the benchmark, however."
Now, it is quite likely benchmarks may be considered for scoring in future iterations if they are not scored right now, and all will be cited in the report regardless, overviewed in the presentation, etc. Hopefully some participants may have time to consider all the benchmarks, but we cannot make anyone do anything and it is up to participants based on their own time, constraints, etc. We are of course happy to discuss further, etc.
As it currently stands, I only see it seems 4 tools that have nominated benchmarks for scoring, which is quite a bit less than prior years (there are something like 20 tools registered). We will send an email out as perhaps some have not been monitoring the git issues that may alleviate some of the currently discussed concerns.
I have compiled the currently nominated benchmarks to be scored, they are as follows; if I have missed anything, please let us know. I have emailed the listserv as well just now, as there are ~20 tools registered, but only 5 7 have nominated for scoring at this time, as well as to solicit further feedback on the scoring discussion. We are extending the deadline for nomination of scored benchmarks to tomorrow AOE end of day based on the current status, as some may have not been following the github issues.
AlphaBetaCrown ViT: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vit NN4sys 2023: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/nn4sys
Marabou dist-shift benchmarks: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/dist_shift traffic signs recognition: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition
NNV (@ChristopherBrix please pull in to 2023 repository based on discussions above): AcasXu: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/acasxu Collins-RUL: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/collins_rul_cnn
nnenum cGAN: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan VGGNET2023: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vggnet16
GravityNN ml4acopf: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf
FastBATLLNN (@ChristopherBrix I did not see in the repository at your link, please update / confirm if identical to 2022 version or not) tllverifybench: https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/tllverifybench
DPNeurifyFV AcasXu (previously nominated by NNV): https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/acasxu cGAN (previously nominated by nnenum): https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan
ttj update 6/20/2023 12:15pm eastern US: added FastBATLLNN and DPNeurifyFV
ttj update 6/17/2023 4:25pm eastern US: added gravityNN nominating ml4acopf
@Z-Haoruo I could not find which tool you are participating with, can you please say which tool you are nominating on behalf of for the ml4acopf benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf )?
Is there an overview of the benchmarks available for nomination? Are benchmarks from previous editions of VNN-COMP automatically available?
Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.
The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.
To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:
Update: benchmark submission deadline extended to June 2 (was
May 29).