Benchmark discussion - Githubissues

stanleybak commented 1 year ago

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can them provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code. The repository must be structured as follows:

It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)

Update: benchmark submission deadline extended to June 2 (was ~~May 29~~).

ttj commented 1 year ago

Is there an overview of the benchmarks available for nomination? Are benchmarks from previous editions of VNN-COMP automatically available?

Newly proposed benchmarks this year are available here, you will need to go up through the discussion on this issue to see further details on them:

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks

Last year are here and detailed in the prior reports:

https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks

We are planning to rerun all the benchmarks from VNN-COMP'22, last year, at least, given the common infrastructure set up for last year and so that a comparison across years can be done, however the scoring mechanisms and nomination process are the same by the tool participants. Before last year, things were somewhat less standardized and may be harder to consider, but from 2021 are here:

https://github.com/stanleybak/vnncomp2021/tree/main/benchmarks

Z-Haoruo commented 1 year ago

@Z-Haoruo I could not find which tool you are participating with, can you please say which tool you are nominating on behalf of for the ml4acopf benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf )?

@ttj Of course! The tool I am participating with is called GravityNN. Sorry that I didn't mention this in my benchmark nomination reply.

ttj commented 1 year ago

I have gone over the procedure from last year to see any differences and how the confusion regarding at least 2 tools executing a benchmark for it to be scored arose (this was required last year and detailed in a variety of posts, I am happy to provide pointers to github posts, emails, etc. as I went through all the materials to confirm; somehow it simply had not made it into the rules document).

The main difference from last year, is that some participants this year requested to know the scored benchmarks at this time, rather than after submission of tools. Last year, the participants selected the scored benchmarks after tools were submitted, presumably after they had time to determine which benchmarks their tool performs well on. This mechanism resulted in all benchmarks being scored last year, as given participants did not know a priori which benchmarks would be scored until final submission, they were incentivized to support all the benchmarks to maximize their score.

This is another option we can consider: to allow all the proposed benchmarks to be considered for scoring at this time, with the final selection based on submitted results and nomination at that time, as what was done last year. There are of course some differences between the benchmarks and constraints as raised in a variety of other posts.

It will be great if some other participants can share their perspective on this issue.

ttj commented 1 year ago

@Z-Haoruo I could not find which tool you are participating with, can you please say which tool you are nominating on behalf of for the ml4acopf benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf )?

@ttj Of course! The tool I am participating with is called GravityNN. Sorry that I didn't mention this in my benchmark nomination reply.

Thanks, I have updated the other post with the currently nominated benchmarks for scoring

regkirov commented 1 year ago

Based on some discussions above, there are critical updates to Collins benchmarks. Please see below.

1) Collins-RUL-CNN @Neelanjana314 @ttj @ChristopherBrix Since the benchmark has been nominated also for this year, we have re-checked it and implemented a fix to the property generator script (some properties have not been negated last year). Please consider fetching this updated version. I have run it through the submission tool and all checks passed. The new version can be taken from here: https://github.com/loonwerks/vnncomp2022 (latest commit)

2) Collins-YOLO-Robustness @KaidiXu @ttj @ChristopherBrix Based on the comments of @KaidiXu several fixes have been implemented, such as (1) declarations of input variables (some were missing) and (2) Reducing output constraints only to specific Y variables (previously we constrained all outputs, most of which had to be kept unchanged; this formulation could have lead to trivial solutions) Updated benchmark has been passed through the submission tool. Please fetch the new version here: https://github.com/loonwerks/vnncomp2023 (latest commit)

Thanks for your time for considering these updates!

wu-haoze commented 1 year ago

@anwu1219 Hi, most parts of our benchmark are the same as the last year. We are very willing to provide you with our models.

For mscn_2048d.onnx model, you can download it from last year's VNN repo https://github.com/ChristopherBrix/vnncomp2022_benchmarks/blob/main/benchmarks/nn4sys/onnx/mscn_2048d.onnx.gz and we have also updated our repo to include the model, you can find it here: https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx

For the mscn_2048d_dual.onnx model, as it's larger than 100MB, we have to store with lfs, you can find lfs link in this repo: https://github.com/Khoury-srg/VNNComp22_NN4Sys/blob/master/model/mscn_2048d_dual.onnx or https://github.com/Khoury-srg/VNNComp23_NN4Sys/blob/main/onnx/mscn_2048d.onnx I think the first one is not over its data quota currently.

Or google drives links: mscn_2048d.onnx , mscn_2048d_dual.onnx

Hope it helps!

Thanks a lot! I'm able to download the networks from the Google drive links.

huanzhang12 commented 1 year ago

@ttj Thank you for looking into the discussions about this rule (at least 2 tools executing a benchmark for it to be scored) in previous years. It did not really come into play previously, because it was not hard for many teams to support all benchmarks. It was not documented at all, largely because it did not really matter.

The main difference from last year, is that some participants this year requested to know the scored benchmarks at this time, rather than after submission of tools.

The true difference this year is that there are more hard benchmarks, especially those proposed by people outside of the verification field and from quite diverse application domains. It is almost impractical to solve all these hard benchmarks within such a short timeframe, so it is critical to know which benchmarks will be evaluated. If we don’t know which benchmarks will be included for scoring, choosing which benchmark to work on becomes a gamble. Luck will play a large role in the final outcome, and efforts may not be rewarded. That does not sound right.

they were incentivized to support all the benchmarks to maximize their score.

This undocumented rule might work as an incentive previously, but going forward, it does more harm and does not work as intended. Last year, the benchmark requiring the most effort was the Carvana Unet benchmark, with only three teams completing this benchmark. This year (and potentially future years), quite a few benchmarks require a similar amount of work as this benchmark, and given the very short time remaining, it is possible that some benchmarks will only be completed by one team. The teams would not have incentives to nominate or work on these hard benchmarks because they might become the lone team working on that hard benchmark. This will also disappoint benchmark proposers because they made a lot of effort to create benchmarks.

This is another option we can consider: to allow all the proposed benchmarks to be considered for scoring at this time, with the final selection based on submitted results and nomination at that time, as what was done last year.

I hope to see the competition is fair in the sense that the outcome is not stochastic and is proportional to the effort a team makes, and also encourages people to tackle challenging novel applications and benchmarks. Thanks for the discussion, and I hope other participants can share their perspectives on this issue as well.

pomodoromjy commented 1 year ago

Hi @ttj , thank you for organizing this valuable competition and providing the opportunity for industry benchmark-only participants.

We proposed one benchmark last year and one benchmark this year. The passion we have for the competition originates from our hope that through our efforts, we can establish a bridge between the industrial sector and the academic world, propel the development of this field in industrial applications, and ultimately empower the industrial sector to create safer and more reliable products.

However, we are somewhat concerned about the current situation. As a non-tool participant, we can't support our own benchmark. If our benchmark doesn't receive sufficient support, competitors will lose the motivation to spend time resolving the benchmark we propose, and ultimately our efforts may be in vain (even in the next year, the benchmark we propose could potentially be ignored by competitors for the same reasons). I believe we need to find a mechanism that ensures fairness and guarantees each valid benchmark can be thoroughly discussed and studied. Only in this way can a virtuous cycle be produced in future competitions.

ttj commented 1 year ago

Thanks @huanzhang12 @pomodoromjy and all

Thank you for the continued discussion, it will be great if some other participants will also comment, particularly other tool participants.

I understand the differences from last year in terms of benchmark and @huanzhang12 perspective, and as no one else seems to particularly have a strong opinion as of yet, we may end up going with a plan to score all benchmarks or all benchmarks nominated for scoring, but it is still the weekend and no one else may have had a chance to respond. E.g., another possible option is scoring all proposed benchmarks, and not only nominated ones, given the few current number of tools planning to submit results (again, it appears to be 5 unless others jump in and post things soon), which would help to alleviate the concerns of the benchmark proposers and would negate any concerns from the benchmark proposers about their benchmarks not being considered, and from tool developers from wasting time on benchmarks. In the past, we have not had such a disparity between the number of tools (at least as it currently seems) and number of proposed benchmarks, which on the one hand is great we have the new and exciting benchmarks, but for the competition side, is leading in part to this discussion.

More broadly, there are considerations beyond one team and the benchmark proposers that are arising and we need to take into consideration. In the current status, there appear to be only 5 tools who plan to submit results, including 2 from organizers (NNV and nnenum); again, about 20 have registered. Why? This can be for a variety of reasons (time, change of plans, etc.), but one reason perhaps is also that many of the proposed benchmarks are mostly new benchmarks that have different architectures, layer types, etc., than many tools may not support, instead of benchmarks with more typical architectures, layer types, etc., that more tools may support already without requiring tool modification, in addition to just getting the benchmarks into the tools. But we will see if any others nominate some to be scored if we continue with the current process of nominating benchmarks now, instead of after tool submission as done last year (see screenshot attached).

This new vs. reuse of benchmarks is perhaps a difference between the verification and ML/AI cultures that is partly arising, if one is to look at other competitions in the domain, where in verification it is more common for benchmarks to be reused each year so that performance changes among tools can be evaluated, whereas I supposed in AI/ML, new architectures that perform the best in terms of performance (accuracy, etc.) is typically the criteria. While I understand possible disappointment from benchmark proposers if their benchmark does not end up being scored this year, this aspect was clearly defined in the rules that the benchmarks must be nominated for scoring, and again, I do believe they may get used in future iterations as people have time to look at the benchmarks. (And again, if one needs an incentive in spite of that possible disappointment at this time, please submit a description of the benchmark to AISoLA: https://aisola.org/tracks/c1/ , please email me, the deadlines are flexible and there are different publication outlets [conference paper in LNCS, journal paper in STTT, talk only, etc.]).

As also has arisen in some of the comments, groups could propose architectures for which there are few or no existing tools, and as such, would have 1 or no participants, for instance, verification of neural ODEs/PDEs, RNNs, LSTMs, etc., for a variety of reasons. This again was the basis of this rule in fact, as this issue arose even in the first iteration of VNN-COMP, when there was in fact discussion about RNNs. In some of these cases, there are other existing events, particularly in domains where various things are less well defined still, such as the usage of neural networks in control systems (e.g., at ARCH-COMP AINNCS).

I imagine the current status of only 5 tools proposing scored benchmarks, unless others jump in shortly, is partly a result of the currently nominated benchmarks, as well perhaps as how prior VNN-COMP iterations played out, so it really will be great if other tool participants, even if they are not planning to submit results now, would comment perhaps as to why they do not participate (time, can't support benchmarks, possible negative publicity, whatever the case). Of course, this few number of tools is also an argument to eliminate the 2 benchmark results rule, as just due to numbers currently it in essence necessitates some benchmarks will not be scored (max of 10 benchmarks could be scored currently). So, I think we need to get some opinions from other tool developers who plan to submit results, as another possible outcome in the current situation is simply that no/few tools submit any results (again an argument against the 2 benchmark results rule perhaps, but then is this is a competition? perhaps in the AI/ML sense, but not what typically is considered in verification at least, and at least to me, is more of a challenge).

My original hope was for this to be a challenge and to foster the community, particularly with mechanisms like others submitting great benchmarks, but as has been mentioned, things became more competitive in various ways, and some of this may also be resulting in the current status of 5 planned participating tools, perhaps partly based on the possible benchmarks that likely many cannot support in a few weeks time.

So, we really need others, particularly other tool participants, to comment on this issue, as I think we reasonably understand the perspective of benchmark proposers and 1 tool group, but of course everyone is more than welcome to continue sharing their opinions and perspectives, especially so that we come to a satisfactory resolution.

Finally @huanzhang12: Continually you are referring to this as an undocumented rule, new rule, adding a rule, etc.; please see attached screenshot from last year's nomination process email, showing the benchmark nomination process for scoring after tool submission. I believe you participated last year and selected benchmarks through this process, and I think this is documentation (and again I am happy to provide links to the various github posts about it from over the years). So, please stop referring to things in this way, as the connotation of such remarks and others that have been made similarly in nature, is frankly becoming a bit insulting: it simply was an oversight it was not in the rules document as has been mentioned now several times, as clearly the organizers were under the impression it was, as it has always been a rule from the 1st VNN-COMP. Given the way some comments are being phrased in some of the posts, I do think it is important the community understand the current situation and prior procedure and that this is not a change.

While I understand the frustration of the situation, let us all please continue to maintain respect so that we can come to a satisfactory resolution in light of the current situation. However, as stated, we need more input from more participants, and hopefully we will see what others have to say.

huanzhang12 commented 1 year ago

Thank you @ttj for the insightful discussion, and I greatly appreciate your understanding of the changes in situations this year.

I can understand your point on the difference between traditional verification competitions and this competition. ML/AI is a field moving fast and things like RNN/LSTM etc you mentioned have important applications, and it would be awkward to exclude them from the competition. For example, this year, both our team and @feiyang-cai (nnenum) propose models with Transformer architecture (ViT and cgan benchmarks), given its recent popularity. I think it is a positive push to strengthen the community and an important move to enable more practical applications of formal methods. I agree with you that this introduces challenges for participants, and that’s why we are discussing about the rule of at least 2 tools executing a benchmark, since it would lead to undesired consequences in the current situation.

but one reason perhaps is also that many of the proposed benchmarks are mostly new benchmarks that have different architectures, layer types, etc., than many tools may not support, instead of benchmarks with more typical architectures, layer types, etc.

We can potentially consider a “core” set of benchmarks with simple networks (e.g., feedforward networks, simple ResNets etc that have been previously supported by many tools) in the same style as other traditional verification competitions, and a few “advanced” benchmarks with the latest architecture and applications. Scores will be computed on both sets with one winner per track. This will also lower the barrier for tools; at least people can compete on one track, and we can see more comparisons of tool performance in basic settings there.

please stop referring to things in this way, as the connotation of such remarks and others that have been made similarly in nature, is frankly becoming a bit insulting

Apologies for the word choice. To clarify my point, in previous years, this rule did not have any real impact, so it was an oversight that it was not formally added to the documentation. I probably should say “enforcing this informal rule” rather than “adding this undocumented rule”. Sorry if that makes you uncomfortable. (PS I couldn’t find info regarding this rule in the screenshot, but that’s totally fine, I completely understood the rule had been mentioned somewhere in previous years).

Thank you again for the constructive and informative discussion, and I also greatly appreciate the feedback from benchmark proposers @pomodoromjy @merascu @naizhengtan. I hope other people can chime in, too.

ttj commented 1 year ago

Thanks for further discussion @huanzhang12

We can potentially consider a “core” set of benchmarks with simple networks (e.g., feedforward networks, simple ResNets etc that have been previously supported by many tools) in the same style as other traditional verification competitions, and a few “advanced” benchmarks with the latest architecture and applications. Scores will be computed on both sets with one winner per track. This will also lower the barrier for tools; at least people can compete on one track, and we can see more comparisons of tool performance in basic settings there.

This is a good idea and in the 1st VNN-COMP, we did have categories, in part based on architecture type (particularly activation function type). I think for this year, this may perhaps be too drastic a change on the timeframe, but is a good topic to revisit in the future if unattainable this year. Of course, we could re-score all the prior benchmarks from VNN-COMP'22 probably in a relatively easy manner, and this may also be a good compromise. So if anyone else is in favor of this idea, particularly if it may boost participation of tools this year, please let us know.

Apologies for the word choice. To clarify my point, in previous years, this rule did not have any real impact, so it was an oversight that it was not formally added to the documentation. I probably should say “enforcing this informal rule” rather than “adding this undocumented rule”. Sorry if that makes you uncomfortable. (PS I couldn’t find info regarding this rule in the screenshot, but that’s totally fine, I completely understood the rule had been mentioned somewhere in previous years).

Thank you and no problem at all of course, and of course hard to tell tone, etc. on the internet with just corresponding via text; I just do not want it to seem like organizers are trying to unilaterally change rules or anything, as one could perhaps perceive based on some of the comments, and I do realize the situation is frustrating and the impact of the rule is different this year. And you are correct, as I have looked again, I grabbed the wrong screenshot that just covered the nomination process difference from last year (I posted before my morning coffee ;-) ), but at any rate...

Really we just need the further input of others to decide how to go forward and to ensure everyone is agreeable to the resolution of the situation. Personally, I am fine with scoring all the nominated benchmarks and eliminating the 2-tool results rule, and imagine that is the likely outcome, but I want everyone to have had time to comment in case there are any other strong opinions and to ensure everyone is on the same understanding. Per the rules as well, we always still can consider calling a vote if it becomes necessary if further contention arises, but it seems at least with currently shared opinions, I think there is reasonable clarity to likely score all nominated benchmarks.

Of course, there are still at least a couple un-nominated benchmarks I think, which differs from prior years, so hopefully others are going to nominate some soon.

wu-haoze commented 1 year ago

@ttj Thanks for the discussion and organizing the event. I might have missed some of the arguments since this thread is getting very long.

I agree if we include the 2-tool results rule, it disincentivizes new benchmark submissions; but if we eliminate that rule, it disincentivizes tool submissions (the 2-tool rule is already quite dis-encouraging). Moreover, currently consistency between different solvers is the only way (which is already not ideal) to establish any confidence in the correctness of unsat results.

For these reasons, I like @huanzhang12 's suggestion on considering a "core" set of benchmarks and an "advance" set of benchmarks. Could the "core" set of benchmarks be defined as benchmarks that are supported by more than half of the submitted tools and the "advance" set of benchmarks are ones that are nominated (or even all proposed benchmarks)?

regkirov commented 1 year ago

Just to share our perspective as an industrial benchmark provider (not a tool participant). I went through the discussion thread and I understand and appreciate the concern raised by @huanzhang12, while at the same time I understand the concerns raised by @ttj. It is very challenging to set up framework and rules that would fit all and maximize fairness.

Our objective as benchmark provider is to inform the VNN community about realistic problems and needs for verification that we are facing - to contribute, but also to stimulate the tools to further improve and be able to support such problems. At the same time we are, of course, curious to monitor the results and see which tools, if any, are capable of solving our benchmarks to facilitate further collaboration. If this year no one is willing/able to exercise the benchmark, I think it would be fair for us to consider that the tools, at least those participating this year, are not yet ready to address the challenge (still need to put significant effort to adapt the tool or not scalable yet or else).

We acknowledge that the benchmark that we submitted this year, Collins-YOLO-Robustness, is very challenging - even though there were some tool authors (not participating this year! ;) ) who claimed their tool would be able to make it. We also acknowledge that such benchmarks are new to the competition, as also @huanzhang12 pointed out. On our end, I do not expect any disappointment or demotivation if the benchmark is not scored this year. We would also be happy to resubmit it to AISOLA event as a backlog element for future competitions, if this helps.

But of course, if possible, we would be very interested to see any results or even attempts to run the benchmark within this competition, even if unscored. I am also a bit surprised with the fact that only 5 tools decided to compete so far, if I understood right the discussion above. But it is what it is. We strongly consider this activity useful and important and we are glad to contribute with more benchmarks in the future.

phK3 commented 1 year ago

On behalf of DPNeurifyFV, I nominate

ACAS Xu and
cGAN

ttj commented 1 year ago

On behalf of DPNeurifyFV, I nominate

ACAS Xu and

cGAN

Thanks! These 2 are already nominated as per the current list of nominated benchmarks to be scored (ACAS Xu by NNV and cGAN by nnenum):

https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1595747102

I realize there may be some confusion given the long discussion about rules ongoing and other possible ways to handle the scoring/nomination, but with the current status, perhaps you may want to nominate 2 different benchmarks that have not yet been selected to be included in the scoring? If not though, that is fine.

Neelanjana314 commented 1 year ago

We (the Marabou team) nominate the dist-shift benchmarks and the traffic signs recognition benchmarks for scoring.

Hi @anwu1219, is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13? We tried to load the models with onnx (python) and tried saving them in an earlier opset version, that didn't work for many of them. Could we get the original training files (pytorch, tf) or the models with opset 13, that could help us avoid some of the issues we are having loading the models in MALTAB?

wu-haoze commented 1 year ago

We (the Marabou team) nominate the dist-shift benchmarks and the traffic signs recognition benchmarks for scoring.

Hi @anwu1219, is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13? We tried to load the models with onnx (python) and tried saving them in an earlier opset version, that didn't work for many of them. Could we get the original training files (pytorch, tf) or the models with opset 13, that could help us avoid some of the issues we are having loading the models in MALTAB?

Hi @Neelanjana314 , the traffic sign recognition benchmarks were provided by another group @merascu. Please do let me know if you're having trouble with the dist-shift benchmarks, which are concerned with only one network: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/dist_shift/onnx/mnist_concat.onnx.gz

jferlez commented 1 year ago

Sorry for the late contribution here: I proposed a benchmark but mistakenly failed to nominate it for scoring.

If there's still time, I would like to nominate tllverifybench on behalf of FastBATLLNN.

Neelanjana314 commented 1 year ago

We (the Marabou team) nominate the dist-shift benchmarks and the traffic signs recognition benchmarks for scoring.

Hi @anwu1219, is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13? We tried to load the models with onnx (python) and tried saving them in an earlier opset version, that didn't work for many of them. Could we get the original training files (pytorch, tf) or the models with opset 13, that could help us avoid some of the issues we are having loading the models in MALTAB?

Hi @Neelanjana314 , the traffic sign recognition benchmarks were provided by another group @merascu. Please do let me know if you're having trouble with the dist-shift benchmarks, which are concerned with only one network: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/blob/main/benchmarks/dist_shift/onnx/mnist_concat.onnx.gz

@anwu1219 thanks for your response. We actually needed the lower opset version for the traffic sign benchmark.

@merascu is there a possibility that the onnx models be saved in an earlier version, e.g., opset 13?

ttj commented 1 year ago

Sorry for the late contribution here: I proposed a benchmark but mistakenly failed to nominate it for scoring.

If there's still time, I would like to nominate tllverifybench on behalf of FastBATLLNN.

Thanks, I believe this is fine, if anyone has an issue, please raise it. I have updated the post with currently nominated benchmarks. I assume this is the same as in VNN-COMP'22, otherwise, please state difference and coordinate with @ChristopherBrix on getting into the repository for this year as I don't see it there.

https://github.com/ChristopherBrix/vnncomp2022_benchmarks/tree/main/benchmarks/tllverifybench

https://github.com/stanleybak/vnncomp2023/issues/2#issuecomment-1595747102

jferlez commented 1 year ago

Thanks, @ttj: I really appreciate it. (And also apologies @ChristopherBrix, if this is causing you extra inconvenience.)

The only difference this year is a small change to the way properties are generated: they are now more balanced between SAFE/UNSAFE instances (see commit dc1629a). Otherwise, I only had to move the benchmark files into the root directory of the repository to successfully pass testing (submission 698).

xiangruzh commented 1 year ago

Thanks, @ttj: I really appreciate it. (And also apologies @ChristopherBrix, if this is causing you extra inconvenience.)

The only difference this year is a small change to the way properties are generated: they are now more balanced between SAFE/UNSAFE instances (see commit dc1629a). Otherwise, I only had to move the benchmark files into the root directory of the repository to successfully pass testing (submission 698).

Hi @jferlez , I have a question about that latest updated generate_properties script. When I tried to verify the properties generated by it, it turned out that all the instances are UNSAFE. So I wonder if there's something wrong with it.

Since the restraints for outputs written in the vnnlib files are properties of the counter examples, according to my understanding, when propDirection is ">=", in order to let SAFE and UNSAFE instances be generated with equal probability, we should push the domain of propThresh smaller instead of larger, similar to the opposite situation.

So basically, I'm thinking is it what it meant to be to change "- 0.5 outputWidth" to "+ 0.5 outputWidth" in the line 60, and do the opposite in the line 62? Thanks.

jferlez commented 1 year ago

Thanks, @xiangruzh: you are correct. As you note, the cases are switched. I have fixed it, and it will be submission 699 as soon as it finishes running. (I have since verified a mixture of SAFE/UNSAFE cases myself.)

ChristopherBrix commented 1 year ago

@shizhouxing The model was too big for GitHub, but the upload to Sciebo (a different platform for file storage) failed. I fixed that, if you pull the most benchmark repository and run the new setup script you should get that file.

@anwu1219 Those files aren't supposed to be lfs objects and if I clone the repository they're downloaded just fine. Could you give me some more details to help me reproduce this issue? How do you clone the repository?

@xiangruzh The Yolo benchmark has been updated

@regkirov I've added the Collins-RUL-CNN benchmark (currently running) and updated the Collins-YOLO-Robustness benchmark

@jferlez I've added the updated verison of tllverifybench

@ttj I added AcasXu, Collins-RUL-CNN and tllverifybench

I hope I didn't miss anyone - please check if the benchmark repository (https://github.com/ChristopherBrix/vnncomp2023_benchmarks) reflects the correct state of your benchmarks and let me know if something is wrong.

ttj commented 1 year ago

@ttj I added AcasXu, Collins-RUL-CNN and tllverifybench

Thanks! I think ACAS still missing, maybe folder not added?

pomodoromjy commented 1 year ago

Hi @ChristopherBrix , there is a minor issue on CCTSDB-YOLO benchmark. We have update our benchmark https://github.com/pomodoromjy/vnncomp-2023-CCTSDB-YOLO

merascu commented 1 year ago

merascu

@Neelanjana314 I added here [1] the directory h5 with the the corresponding models which are now in the onnx directory. I hope this helps.

[1] https://github.com/apostovan21/vnncomp2023

ChristopherBrix commented 1 year ago

@pomodoromjy I've updated the repository

ttj commented 1 year ago

@ttj I added AcasXu, Collins-RUL-CNN and tllverifybench

Thanks! I think ACAS still missing, maybe folder not added?

Sorry, forgot to mention you @ChristopherBrix , can you please check ACAS when you have a chance?

ChristopherBrix commented 1 year ago

Turns out one actually has to push changes to GitHub, not just commit them locally... Sorry, should be fixed now!

ttj commented 1 year ago

The senior organizers (Stanley Bak, Changliu Liu, and myself) have corresponded regarding the rule that 2 tools must submit results for a benchmark to be scored. Having looked over all the opinions, feedback, and suggestions, and of course in light that this rule was not listed in the official rules, we recommend to eliminate the rule for this year, and thus all benchmarks that have been nominated by tool authors will be used for scoring. This is likely the best way to incentivize tool authors to support as many benchmarks as possible and also to enable reasonable fairness given the various opinions. If anyone has a strong opinion otherwise, we are happy to organize a vote as per the rules, but as consensus generally seems close to this result. If you would like us to organize a vote to consider other options, either post here indicating so, or if you do not want to post publicly, please email me and we can maintain anonymity.

Some of the other discussed ideas (having categories like core/advanced benchmarks or benchmarks that are >= some time duration old, from the prior VNN-COMP, etc.) are certainly great, but perhaps we are too short on time to implement this year, and would be rather drastic changes to the rules. We can however highlight things in the report and are open to other suggestions. As we have done in the past with some other controversial issues (overhead timing for example), we plan to consider scoring in some different ways for presentation/reporting only, but not in any of the official scoring.

As some proposed benchmarks currently were not nominated for scoring, as I have mentioned, please contact me if you would like to consider some archival publication options (e.g., AISOLA https://aisola.org/tracks/c1/ , but there are other options too). We will still include all benchmarks in the report (if desired) as we start to write it following the presentation at CAV/FOMLAS and also to overview them in the presentations, so that the broader community is aware of all of these, and will be in touch about this as we get closer.

For future iterations of VNN-COMP, we will plan to discuss this and the general rules, in part at the workshop and other upcoming events, and as always are welcome to any and all feedback.

xiangruzh commented 1 year ago

Hi @merascu , @apostovan21 , @ChristopherBrix , I noticed that in the gtsrb benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition), the list in instances.csv doesn't match the files in the vnnlib folder. Is it because the seeds used to generate them are different?

KaidiXu commented 1 year ago

Hi @regkirov, thanks for your fix! There is a still minor issue in the vnnlib. The semicolon ";" is missing for the comments in the vnnlib, like these lines: Input variables, Output variables, Input constraints, and Output constraints.

llylly commented 1 year ago

@HanjiangHu From the models that I looked over, the custom Projection OP is the first "layer" in the networks.

I don't know if it is too late for this, but could you update the files (both onnx and vnnlib) to neural networks starting after the custom ProjectionOP? Meaning the input layer is directly connected to the first Conv layer.

And same for the vnnlib files, instead of defining the input as a 1-dimensional input, create the vnnlib files after the Projection OP, so the input dimensions X (X_0, X_1, ... X_N) correspond to the input variables of the first Conv layer.

That will avoid having participants add support for a custom function that will not necessarily be used in any future benchmarks and help more participants analyze them.

It also seems that the custom projection operation requires loading a dataset and do some type of lookup operation to generate the output of that layer? https://github.com/HanjiangHu/metaroom_vnn_comp2023/blob/main/randgen/custom_projection.py#L25

I can't speak for everyone, but when I load those in MATLAB (what we use), I cannot get any information about that layer / operation, just that it is a custom operator, so it will be very very difficult for us to support this benchmark at the moment.

Stanley Bak: I agree that if the spec is done after the custom OP it may be supported by more tools. For it to count as a scored benchmark we'd want at least two tools that support it.

Hi @stanleybak @ChristopherBrix, for Metaroom benchmark, will the version without projection OP or with projection OP be used? I saw in the repo it is the version with projection OP. Thanks!

regkirov commented 1 year ago

@KaidiXu Thanks! My bad. Fixed, resubmitted. @ChristopherBrix Please consider the new submission.

merascu commented 1 year ago

Hi @merascu , @apostovan21 , @ChristopherBrix , I noticed that in the gtsrb benchmark ( https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition), the list in instances.csv doesn't match the files in the vnnlib folder. Is it because the seeds used to generate them are different?

@xiangruzh We (cc @apostovan21) updated the repo. @ChristopherBrix Please consider the new submission. Thanks all!

[1] https://github.com/apostovan21/vnncomp2023

ChristopherBrix commented 1 year ago

@regkirov I've updated the Collins-YOLO-robustness benchmark.

@merascu I've updated the Traffic signs recognition benchmark.

llylly commented 1 year ago

Hi @stanleybak @ChristopherBrix, for Metaroom benchmark, will the version without projection OP or with projection OP be used? I saw in the repo it is the version with projection OP. Thanks!

Try to pop up this thread. The Metaroom version in vnncomp2023_benchmarks repo has a customized OP in the onnx model which is hard to parse and use. Instead, the author provided a version without this customized OP: https://github.com/HanjiangHu/metaroom_vnn_comp2023/tree/main/no_custom_OP. Will the official benchmark repo be updated?

merascu commented 1 year ago

@regkirov I've updated the Collins-YOLO-robustness benchmark.

@merascu I've updated the Traffic signs recognition benchmark.

@ChristopherBrix Hi! I just observed that the file instances.csv is still 3 weeks old but it should have been also updated with the new one from https://github.com/apostovan21/vnncomp2023 together with the others updated last days.

ChristopherBrix commented 1 year ago

@HanjiangHu Please have a look at the PR I created for your Metaroom benchmark - the generate_properties.py file at the root dir should be the one used for the competition (i.e., the one without the custom OP). Please let me know if the change is ok. In the meantime, I've updated the benchmark collection to use that new version. @llylly Sorry for the delay, the repository is updated

The VGGNet16 benchmark was missing an onnx file, that one is now being downloaded when setup.sh is run.

@merascu When I run python3 generate_properties.py 676744409 I get exactly the instances.csv file from the benchmark collection repository. Your last commit only changed vnnlib files, so what the entries in instances.csv refer to might have changed, but the pure content of that csv file stayed the same. If this is not what you expected, please let me know and update your repository accordingly. I'll then update the benchmark collection.

HanjiangHu commented 1 year ago

@HanjiangHu Please have a look at the PR I created for your Metaroom benchmark - the generate_properties.py file at the root dir should be the one used for the competition (i.e., the one without the custom OP). Please let me know if the change is ok. In the meantime, I've updated the benchmark collection to use that new version. @llylly Sorry for the delay, the repository is updated

@ChristopherBrix Thanks for the PR and the update! I have merged it and resubmitted it on https://vnncomp.christopher-brix.de/benchmark, as well as updated the readme. @llylly Thank you very much for the reminder!

merascu commented 1 year ago

@HanjiangHu Please have a look at the PR I created for your Metaroom benchmark - the generate_properties.py file at the root dir should be the one used for the competition (i.e., the one without the custom OP). Please let me know if the change is ok. In the meantime, I've updated the benchmark collection to use that new version. @llylly Sorry for the delay, the repository is updated

The VGGNet16 benchmark was missing an onnx file, that one is now being downloaded when setup.sh is run.

@merascu When I run python3 generate_properties.py 676744409 I get exactly the instances.csv file from the benchmark collection repository. Your last commit only changed vnnlib files, so what the entries in instances.csv refer to might have changed, but the pure content of that csv file stayed the same. If this is not what you expected, please let me know and update your repository accordingly. I'll then update the benchmark collection.

@ChristopherBrix But in the repo [1] the instances.csv is as old as the vnnlib files. I was carefull generating it. Also, I don't understand why the argument 676744409? In the readme file is written to be run with argument 42, or 0, or 1. But if you take the instances.csv from [1] it should be the correct one.

[1] https://github.com/apostovan21/vnncomp2023

ChristopherBrix commented 1 year ago

The benchmarks were supposed to contain some randomness (by varying the input bounds, picking a subset of instances out of a larger pool, etc.). That's why generate_properties.py is expected to take a seed as the input. This is done so teams don't overfit their tool on the benchmark. As they don't know which specific instances will be scored, their configurations should be able to generalize to all of them. In the most extreme case, this prevents teams from just pre-computing all solutions offline, and then returning them for the scored run.

676744409 was the final seed last year, I just didn't change it for this initial phase. For the final evaluation, all benchmarks will be updated using a yet-unknown seed (picked based on the current Ethereum(?) hash, so unpredictable).

Can you update your benchmark to reflect this? If no randomization is possible, let me know.

apostovan21 commented 1 year ago

Hello Christopher,

I have made some changes, hope now the script will run as expected (every time you run it it should generate new 45 vnnlib files and delete previous once if there are any).

Sorry for late reply!

ChristopherBrix commented 1 year ago

@apostovan21 Thanks, I've updated the benchmark. Please make sure the current version is what you expect. Let me know if it differs!

ChristopherBrix commented 1 year ago

I've added the test benchmark set, fixed two models in the nn4sys benchmark, and fixed the traffic sign benchmark (I previously uploaded it using an incorrect name).

@regkirov Please see the comment in issue #3 regarding the vnnlib file in the collins_rul_cnn benchmark.

regkirov commented 1 year ago

@ChristopherBrix Thanks for the heads up. Fixed, resubmitted.

ChristopherBrix commented 1 year ago

Great, I've updated the repository.

merascu commented 11 months ago

Hi all! @ChristopherBrix, @ttj, @stanleybak Is it possible to check somewhere the results? Maybe uploading the Zoom recording if it exists? Thanks! ME

ChristopherBrix commented 11 months ago

The results have been uploaded to https://github.com/ChristopherBrix/vnncomp2023_results

stanleybak / vnncomp2023

Benchmark discussion #2