verivital / vnncomp2024

Verification of Neural Networks Competition Main Repository (VNN-COMP'24)
4 stars 0 forks source link

Rules Discussion #1

Open ttj opened 8 months ago

ttj commented 8 months ago

Please discuss rules here, the draft Google docs rules document is available here:

https://docs.google.com/document/d/1bgkx5lnugrwlNzQ2MPRSd47MAkZGJfR9v2jo7oRskd0/edit?usp=sharing

In particular, we need to settle on the benchmark proposal, selection, and scoring processes for this year

ttj commented 7 months ago

Thank you for those who were able to join the meeting, we will make the recording available soon.

After the introduction and overview, there were two primary discussion items.

  1. Selection of benchmarks and tracks/categories for scoring

Anyone may propose benchmarks (tool participants and anyone else); so far there are some benchmarks submitted in the registration form and listed at the benchmark proposal issue #2

There was a proposal (thanks @KaidiXu !) to create 2 tracks for scoring: a regular track and an extended track.

For these tracks, we will set up a voting mechanism (probably Google form similar to registration) to select benchmarks for scoring in the different tracks, with the different tracks to be created based on a number of votes by the tool participants. Any tool participant may vote for any number of the proposed benchmarks. This changes from the past where we had tools nominate 2 benchmarks and required at least 2 tools to submit results for the benchmark.

For the regular track: any benchmark with at least X votes from participating tools will be scored in the regular track: X is to be decided, we discussed different alternatives (e.g., at least 3 votes, 1/3 of all participant tools votes, 50%, etc.): I would suggest 50% of tools so that it is basically a majority plan to support the benchmark to be in the regular track

For the extended track: any benchmark with at least 2 votes from participating tools and not included in the regular track will be scored in the extended track.

For all other benchmarks (so 0-1 votes): these will be unscored, but can be included in the report.

There was a voting aspect of which votes count for the benchmarks: probably the votes should only count from participating tools (those who submit results for at least some benchmark during the evaluation phase), which could possibly change these tracks after the benchmark proposal phase and only be completely known at the evaluation phase, as otherwise non-participants are voting and could be unfair. However, this likely would not significantly impact the regular track, but could change the extended track.

We would then have rankings/winners in both of these tracks per the typical scoring we have been using the past couple iterations.

Otherwise there were not any proposals to substantial change the scoring or rules around benchmark selection, other than to ensure things are available as early as possible to give everyone enough time to look at them and add support if needed.

If there are further comments/concerns/discussions, please comment below.

  1. Meta-solver and % code shared aspects among participating tools:

This was discussed and has come up in most prior VNN-COMP rules discussions. Some tools may use portions of other tools or libraries developed by other teams, e.g., auto_lirpa and CORA both came up as instances of where some participating tools may use these other tools/libraries in their tool (as e.g., NNV uses CORA in some cases as an example, and some other tools may be building upon auto_lirpa, etc.).

It was proposed to mostly eliminate the meta-solver and % duplicate code aspects of the rules, as these have not come up in practice in any prior VNN-COMP in any meaningful way, and instead to ensure any libraries/tools used by any participating tools are properly acknowledged (e.g., via citations to the dependencies in the report). If any strange edge case arises (e.g., submission of multiple tools with just different heuristics, a team just directly reusing another team's tool, etc.), the organizers will handle these cases as they arise and in communication with the participants.

Reasoning: it is too hard to define substantially similar code bases, using % lines of code is complicated (and probably infeasible), and further, many tools have many dependencies and where would we draw the line (e.g., probably everyone using onnx libraries, ML frameworks, etc., and these dependencies may dominate the % code base that is the same). Further, it is common (at least in formal methods research) for subsequent tools to build upon prior tools. Another example would be an SMT-based tool that uses Z3, and suppose there are 2 such tools, should these be excluded from the competition (as likely Z3 would dominate the code bases between the tools)?; probably not, as the higher-level aspects may be important for doing things efficiently and there likely is significant effort in developing on top of these existing libraries.

So, general rule of thumb for dependencies, substantially similar tools/metasolvers: just acknowledge all dependencies appropriately and if anything strange or seemingly unfair arises, we will deal with it when it does (but at least has not mattered significantly in any prior iteration so somewhat a moot point aside from ensuring appropriate credit/citation/attribution).

If there are further comments or concerns on this, please also raise them below.

  1. We ran out of time for discussion, but @ChristopherBrix and I discussed some the numerical precision aspects. One point we will try to enforce is ensuring counterexample witness inputs do in fact lie inside the input set/constraint as defined in the vnnlib spec file, within a tolerance, as this arose some last year in validation.

  2. I think that covered most of the discussion, please comment on any other rule aspects if I missed anything, so we can finalize by end of April latest.

ttj commented 7 months ago

The April 3, 2024 rules telecon recording is available here:

https://www.dropbox.com/scl/fi/24a16mp6lblyfr54mhkcn/GMT20240403-135359_Recording_gallery_2400x1600.mp4?rlkey=f3z86wqitxybvuzj09oii4bai&dl=0

ttj commented 6 months ago

We are a little behind the proposed timeline for finalizing rules, but will do so by end of this week (May 10; sorry, have been swamped), and will also extend new benchmark submission a bit (probably May 17), but we want to give as much time as possible for them to be considered, so need to finalize soon. There has not been any further discussion on the rules, so we will finalize as recommended above unless there are further concerns, suggestions, or comments. For the 2 tracks: as suggested above, let's plan to go with 50% tools voting for the regular track, and >= 2 tools voting for the extended track, with anything with 0-1 tools unscored for rankings, but included in report

We will set up a voting/proposal mechanism for benchmarks for the tools to be scored so we all know which ones are under consideration for the competition. As has been the case in the prior recent VNN-COMP iterations, any prior VNN-COMP benchmarks can be nominated again, so we will work on adding these to the form as well, as most of the benchmark discussion in the other issue #2 so far is mostly proposal of new benchmarks, and will want to finalize all those under consideration by late May at very latest, so everyone has time to consider them

huanzhang12 commented 6 months ago

@ttj Thank you for leading the discussions about the new rules! The points you made are great. Here are my two cents:

  1. I hope the set of benchmarks (either in the regular or extended track) can be fixed before teams start working on them. There could be cases where a few benchmarks just passed the threshold to be in the regular track. If some participants did not submit their tool and we only count the votes from teams that eventually submit tools, it may end up in the extended track. This may considerably change the outcome: if the number of votes per benchmark follows a binomial distribution, there might be a considerable amount of benchmarks that are close to the 50% votes decision boundary, so it is not a rare event. I hope the outcome of this competition will be as deterministic as possible (it is not a sport where people like uncertainty or surprise). So, I propose that after a voting deadline, the set of benchmarks in each track will be finalized. For benchmarks in the regular track, since we expect many votes, even if one or two tools did not eventually submit, the benchmark still shows an effective comparison of many tools.

  2. For the extended track, getting two votes to enter may be challenging if the benchmark poses new challenges that most tools cannot handle. Proposing a new benchmark (especially novel ones) requires considerable effort; if the benchmark ends up being unscored, it's likely no teams will work on it. The benchmark proposers will be discouraged from proposing novel and challenging benchmarks. I believe the purpose of this track is to encourage novel and challenging benchmarks that set up goals for researchers in this field to work on, so lowering the threshold will make it more welcoming and serve its purpose better.

  3. It might be good to add a minimum requirement for the number of votes per team (e.g., each team should at least vote for 30% of all proposed benchmarks to qualify and is encouraged to vote as many as possible). Otherwise, I am worried the 50% threshold might be too high since some teams may not vote at all or only vote for very simple models.

  4. I totally agree with you that the "meta-solver" rule needs to be refined with good reasons. In addition to what you proposed, I propose that in the competition rules, we add a requirement that each tool should clearly disclose whether they used source code derived from existing verification tools or libraries (e.g., we can require a statement included as a separate paragraph after introducing each tool in the report). The purpose is two-fold: first, it helps researchers understand the relationships between the tools (e.g., some new tools might be based on existing ones and share some similar algorithms with improvements in some aspects); disclosing this relationship allows people to have a clearer view of the performance of each tool, rather than treating each tool as entirely new. Second, this also acknowledges the tremendous efforts that existing tool authors made. CORA and auto_LiRPA (part of α,β-CROWN) are good examples here - building an efficient and scalable verification library is not easy. I expect that, in the near future, many new tools will be based on existing ones, so it will be nice to set up a respectful and collaborative culture in this community.

ttj commented 6 months ago

thanks @huanzhang12 for the suggestions and comments! It will be great if others can please share opinions on these suggestions ASAP, we do need to finalize rules shortly (ideally tomorrow so we don't have to extend again) and have agreement among participants on what they are agreeing about in terms of rules

ttj commented 6 months ago

in the interest of time as we need to finalize, going to try mentioning some people I could find github ids for to try to garner any final feedback on rules in case have not seen discussion, please comment if you have any opinion on some of the proposals

@hocdot @haiduong0 @toladnertum @wu-haoze @guykatzz @phK3 @sdemarch @stanleybak @kollerlukas
@aliabigdeli

ttj commented 6 months ago

It seems there are no further opinions, so let's finalize as follows and I will try to get the rules doc updated shortly.

1: After voting, the benchmarks and tracks will be fixed for scoring (so this will be by about end of May), which gives about a month to consider everything, probably a bit more, so that we have time to assemble everything for presentation at SAIV/CAV on July 22/23. We will not modify the tracks based on what tools submit results. However, we will ensure the process is fair (e.g., if some strange edge case arises with some tool participants voting on things and then not participating, hence influencing the competition without active involvement, we will handle it based on discussion with the tool participants).

2: For the regular track, let's go with 50% votes of the tool participants (handling rounding/etc in a reasonable way, we will see how it looks when the votes come in). There are some pros and cons, but one viewpoint is that we understand performance aspects through the competition on a reasonable set of benchmarks supported by most tools, even if the models utilize standard architectures on ML tasks that have been considered in neural network verification up to now.

There are already significant barriers to entry in the competition (supporting VNN-LIB/ONNX, etc.), and most new tools need to start with relatively standard architectures (ReLUs, etc.). The two tracks mitigate this issue handling the AI/ML community versus FM community differences (e.g. challenge vs competition), and with point 1 above, what benchmarks will be considered is going to be set a priori. Broadening architectures is of course a great goal for the community, but this is more likely to happen more slowly over time as new ML tasks and architectures are considered (e.g., in papers and not in this competition).

If any benchmark proposer is disappointed by whatever outcome arises based on the tracks, we can still describe it in the competition report, and I will again advertise AISOLA being very receptive for papers describing benchmarks (see the VNSAI track: https://2024-isola.isola-conference.org/aisola-tracks/ ), where several were presented last year in the SVDNN track: https://link.springer.com/book/10.1007/978-3-031-46002-9#toc

The deadline is flexible enough that if any benchmark proposed is not ended up being scored in either the regular or extended track, we could consider a paper describing it for AISOLA.

3: To be included in the extended track for scoring, let's go with any tool voting for the benchmark (so 1 vote to be scored). This also mitigates potential issues with point 2 above and incentivizes supporting new benchmarks.

4: Tool participants should utilize standard academic/scholarly courtesy and cite the relevant libraries they build upon in the report. We will add a field in the voting form to indicate what significant dependencies tools have, beyond just the standard things like ML libraries, ONNX, etc. We can cross-check this based on the installation scripts, etc. when results are submitted.

5: A final aspect that arose upon some discussion, and similar to point 4, some tool participants actively or previously collaborate with some benchmark proposers, which can give a potential advantage to some tool participants (e.g., they already support the benchmark and do not need to add support for new layers types/tasks, etc.). We will require disclosure of collaborations between tool participants and benchmark proposers, as is standard in handling conflicts of interest (COI) in scholarly publishing/activity, and will collect these in the voting process as well.

As this one is new and had not come up in the discussion above, if any concerns, please speak quickly.

ttj commented 6 months ago

We have updated the rules doc with these changes, available at link above and here again, if any final comments or suggestions, please post soon:

https://docs.google.com/document/d/1bgkx5lnugrwlNzQ2MPRSd47MAkZGJfR9v2jo7oRskd0/edit?usp=sharing

ttj commented 1 month ago

We have shared a draft overleaf report and are also fixing some execution issues identified by some participants. Please let us know if you didn't get it and we will share it, we emailed the listservs so far