stanleybak / vnncomp2022

17 stars 0 forks source link

Rules Discussion #1

Open stanleybak opened 2 years ago

stanleybak commented 2 years ago

The 2022 competition rules are available here.

Please discuss here points about the rules for the competition. Last year's rules are here.

Discussion points off the top of my head:

ChristopherBrix commented 2 years ago

Instead of specifying a concrete list of AWS instances, we could define a maximal cost per hour and everyone can pick their instance as they see fit. One could argue that this means that the results are still comparable, as each tool uses the best hardware for their implementation.

I would propose to reuse at least some of last year's benchmarks, to demonstrate how toolkits improved.

In the first year, you were able to compile some cactus plots that clearly showed which tools performed best. Last year, that wasn't included. Was there a specific reason for that? If it was caused by the benchmark designs, we may want to reconsider those details if we want to simplify the evaluation of the results.

stanleybak commented 2 years ago

In terms of cactus plots, I think this was mostly a time issue. The first year authors submitted their own measurements so I had time to polish the results analysis, whereas last year we needed to (manually) run everything so had less time for analysis. I think it's a good idea to bring back the cactus plots.

Raya3 commented 2 years ago

Last year the possible results were: Holds, Violated, Timeout, Error and Unknown. I want to suggest that in case of Violated, add the possibility to submit a counter example if found.

Thanks, Raya

ChristopherBrix commented 2 years ago

I second that request. I also remember that there were instances where tools disagreed, and it was difficult to judge which result was correct when no counterexamples were generated.

stanleybak commented 2 years ago

The organizers had some discussion on the rules, ahead of the first participant meeting on April 12 at 11am Eastern US time. Here were the ideas discussed:

We want to encourage people from industry to propose benchmarks, so we will allow non-participant benchmarks. For fairness, however, we have to have some mechanism that limits the number of benchmarks from one source. An idea we came up with was to allow two scored benchmarks per participant, and if a specific participant does not have two benchmarks they can adopt one of the non-participant benchmarks as their own and make it a scored benchmark.

The randomness of images for image classification benchmarks was important to prevent overfitting, so we'd want to keep that this year. We weren't sure if there was a good way to do this for other benchmarks though (for example ACAS Xu was not randomized last year). If we have more smaller networks such as those from RL we were unsure if there was an easy way to have randomization or we would need it.

In terms of conflicts when tools disagree, we would have some standard output format for counterexamples. Tools don't need to implement this, but in the case of conflicting answers that would be assumed incorrect if they do not provide the counterexample. As some tools are converting the networks from .onnx to other libraries or running on a GPU, the actual counterexample may be slightly off, for example within floating-point error. We do not plan to penalize such instances (say if the answer is within 10^-5 of a real execution) by calling it unsound. For the ground truth, we would use the output of the onnx network using onnxruntime on a CPU. This shouldn't be an issue unless the counterexample returned is right on the safety boundary, so maybe it's not an issue that will come up.

We would use prepare_instance.sh scripts like last year, where tools can do conversion but shouldn't do any analysis. This would be limited in runtime to 60 seconds. There was some discussion of automated hyperparameter tuning, but we couldn't think of a good way to do this considering that some tools wouldn't implement it.

mnmueller commented 2 years ago

Adding to what @stanleybak wrote we want to propose a standardized format for counterexamples in the style of the VNN-LIB property specification that includes the assignment to input variables, corresponding obtained outputs, and the constraint clause (for disjunctions) that was violated. An example is below:

Property:

; Property with label: 2.

(declare-const X_0 Real)
(declare-const X_1 Real)

(declare-const Y_0 Real)
(declare-const Y_1 Real)
(declare-const Y_2 Real)

; Input constraints:
(assert (<= X_0  0.05000000074505806))
(assert (>= X_0  0.0))

(assert (<= X_1  1.00))
(assert (>= X_1  0.95))

; Output constraints:
(assert (or
    (and (>= Y_0 Y_2))
    (and (>= Y_1 Y_2))
))

And a corresponding counterexample:

; Counterexample with prediction: 1

(declare-const X_0 Real)
(declare-const X_1 Real)

(declare-const Y_0 Real)
(declare-const Y_1 Real)
(declare-const Y_2 Real)

; Input assignment:
(assign (= X_0  0.02500000074505806))

(assign (= X_1  0.97500000000000000))

; Output obtained:
(obtained (= Y_0 -0.03500000023705806))
(obtained (= Y_1  0.32500000072225301))
(obtained (= Y_2  0.02500000094505020))

; Violated output constraints:
(assert (or
    (and (>= Y_1 Y_2))
))

As @stanleybak wrote, we do not expect floating-point errors to become an issue and hence think that returning the exact floating-point representation as a rational number is not necessary and that 16 decimals (as above and used by (some) benchmarks last year) should be more than enough.

ChristopherBrix commented 2 years ago

Last year, teams could decide whether their tool should be run on

The CPU instance is listed in the "memory optimized" category. This year, we could also offer an instance from the "compute optimized" list:

For GPUs, we could consider offering a multi-GPU setup, if any tool would like to utilize that:

If we want to increase the budget to i.e. 5$/hour, we could also offer:

But I didn't find a suitable GPU equivalent.

mnmueller commented 2 years ago

The CPU Instances look very sensible.

However, I think that last year's issue of the very weak CPUs on the GPU instance can not be addressed with the g5g, as the T4 has only about 1/4 the computational performance of a V100. I think the g5.8xlarge or g5.16xlarge with an A10G and 32/64vCPUs at $2.44/hour and $4.10/hour might be worth looking into, although the A10s FP64 performance seems much lower than that of the V100.

If we have the AWS credits, we could also get p3.8xlarge (which has 4 V100s and 32 vCPUs) but let people only use one of the GPUs.

huanzhang12 commented 2 years ago

@mnmueller I agree with you g5.8xlarge is a sensible candidate for GPU instance. The p3.2xlarge instance used last year indeed has a very weak CPU as @mnmueller mentioned and I personally also want to avoid that one. Within a reasonable budget it seems g5.8xlarge is the primary option. I feel g5.16xlarge can be a little expensive for participants to develop/test on AWS, making the competition less accessible.

@ChristopherBrix I think we want to avoid g5g instance (and any instance type ending with "g") because they are ARM based, not regular x86-64. This can cause a lot of headaches. For multi-GPU instance I think g5.12xlarge and p3.8xlarge are the options, although both are expensive ($5.67/$12.24). So in my opinion we should skip the multi-GPU option unless we can get enough AWS sponsorship and also provide some free credits to help each team financially.

@ChristopherBrix The hpca6a.48xlarge instance is a great one, but it was just released this year, and availability could be an issue. In some cases, AWS doesn't approve enough quota. Even with enough quota, the datacenter may run out of capacity and fails to launch the VM. This can cause headaches during evaluation. Can you try if it is possible to spin up multiple (like 5-10) such instances simultaneously on AWS?

Have we decided the budget for AWS instances? I feel the budget really depends on the GPU instance we choose because there are so few GPU options on AWS. We can then choose a CPU instance with a similar cost. For example, if we use g5.8xlarge GPU instance then we can use any CPU instances (maybe choose one compute-optimized, one memory-optimized) that has roughly the same cost ($2.44/hour, +/-10% around that is perhaps ok). We can also allow each team to choose any CPU instance <=$2.44/hour if the organizers are ok to handle the evaluation complexity.

ChristopherBrix commented 2 years ago

So in my opinion we should skip the multi-GPU option

I'm also not sure whether any tools would actually support that feature. So there may not even be any use-case for that. I'd be fine with not offering Multi-GPU instances.

The hpca6a.48xlarge instance is a great one [...] Can you try if it is possible to spin up multiple (like 5-10) such instances simultaneously on AWS?

I'll try and report back. (Edit: I've submitted a request to increase my quota so I can test this, this will take up to 2 days)

We can also allow each team to choose any CPU instance <=$2.44/hour if the organizers are ok to handle the evaluation complexity.

I'm setting up an automated evaluation pipeline, supporting a variety of instances would not be a problem. It would potentially increase the complexity of the final report, so I'm not sure if we want to support this. @stanleybak

pat676 commented 2 years ago

Hi all.

I also agree with @Raya3, that providing counterexamples should be a priority this year.

Some more points for discussion:

huanzhang12 commented 2 years ago
KaidiXu commented 2 years ago

Hi all,

After the rules discussion meeting today, I'd like to thank you for the great organization! But due to the time limit, I think we still don't have consistent opinions on some controversial issues.

I have some opinions on the GPU/CPU selection per benchmark. I feel it is unfair for teams who support only CPU or only GPU, as they are not able to leverage the benefit of different machines. Especially, most tools are CPU only and cannot benefit from this. Allowing this would lead to bias towards certain tools which can essentially take a minimum of the runtime on CPU and GPU instances, while most other tools cannot. So I believe each team should keep using the same GPU/CPU instance for all benchmarks.

Also, for undecided rules, maybe the organizer can summarize a few opinions and start a vote, and let each team make their own choices and end with the majority ones.

Thank you!

stanleybak commented 2 years ago

keep using the same GPU/CPU instance for all benchmarks

We're going to try to get an instance with a sufficiently powerful CPU and GPU as the primary option. I don't think we need to be able to spin up 5-10 instances... honestly last year 1 GPU instance would have been enough so as long as we can get 1-2 we should be okay (like you said, not all tools will use the GPU one).

For undecided rules, there wasn't too much controversy. There were minor things like do we prefer 2 benchmarks or require 1+1 from industry, or 4 hour vs 6 hour timeouts. For efficiency the organizers may just decide internally in the coming days and post the rules document for this year.

Here were my notes on the discussion that will be the basis for the rules document (feel free to add things if I missed something important):

Here's the slides from the meeting.

stanleybak commented 2 years ago

The 2022 competition rules are available here. Please let us know if something is incorrect. Kaidi's idea for voting on any major changes might be adopted in future years.

The main changes from last year are:

  1. all benchmarks must be randomized based on a seed
  2. participants can nominate two benchmarks (one from and outside group is encouraged)
  3. when the result is sat, a counter-example should be produced that give concrete values to inputs and outputs.
huanzhang12 commented 2 years ago

@stanleybak Thanks for drafting the competition rules! The main changes look good to me.

Regarding AWS instance selection, I recently tested the g3.4xlarge (EDIT: I meant g5 not g3) GPU instance and found some issues with it. Its Tesla A10G GPU has quite unbalanced performance: roughly 2x faster single-precision floating-point while being over 7x slower on double-precision than last year’s V100 (e.g., see the table here). It is ok for inference and training, but for verification, precision and soundness are of paramount importance and good double-precision support is essential. Especially, we previously found that on large models errors could accumulate across layers and single precision is often insufficient. So, on this machine, tools with double-precision (sometimes essential for soundness) can be heavily penalized. Although small models are faster because they are largely CPU bound, large models are GPU intensive and heavily penalized. Many people suggested that we should focus on larger models so penalizing them might not be a good idea.

I believe we originally wanted to avoid the instance last year because its slow CPU performance slows down small MNIST models. Due to the limitation of AWS, there is no optimal solution here, but I feel penalizing small models is better than penalizing large ones, and if we aim to build a standardized and future-proof GPU testing environment we should not use a GPU with heavily handicapped double-precision performance, as double precision can be essential for certain verification algorithms and/or larger models. So I personally would actually recommend keeping the same p3.2xlarge instance as last year (with additional benefits like better availability and easily getting apples-to-apples comparisons to last year).

A minor comment about the CPU instance: with only 128 GB memory it can become a limitation. Running 64 parallel LP solvers (or other external solvers) can quickly eat up memory and the memory constraint may become a trouble for some tools. So it might be good to upgrade to the balanced m5.16xlarge instance (same 64 vCPUs, but 256 GB memory, $3.07) to match the cost of p3.2xlarge ($3.06).

@KaidiXu I can see the fairness issue you pointed out and I agree with you on this. But I also agree with @mnmueller that it is a good idea to tell people certain small models are faster on CPU. I think a solution is that, if a team supports both CPU and GPU, they can request the organizer to run their tool on both instance types. The final score will be accumulated from only one single type of instance per tool (specified before final evaluation), although the runtime on both instance types can be reported. So people looking at the report will know “if I run this MNIST model on CPU it could be faster” but it does not affect the fairness of competition. We can perhaps also set up a special award such as “Best tool for heterogeneous hardware” for the tool with highest average CPU+GPU scores.

stanleybak commented 2 years ago

In light of these issues, we could have three instances, a p3.2xlarge (GPU), m5.16xlarge (CPU), and g3.4xlarge (Balanced), even though there's a slight difference in cost.

I feel penalizing small models is better than penalizing large ones

This is not something obvious, I think. There are challenging problems even with small networks so I don't think exclusively focusing on network size should be the goal. Really, I think we want a variety of benchmarks, some larger than before and some small ones that are more challenging than previous years.

ChristopherBrix commented 2 years ago

On Monday, I have a meeting with a sales representative from AWS to increase our quota. Can we agree on some set of instances by then, or should I reschedule?

naizhengtan commented 2 years ago

I want to comment on NVIIDA's A10 GPUs. They are designed for model inferences instead of training. So they provide extremely fast low-precision calculations (like FP16, or even INT8!) because that's what people do for DNN inference. That said, if you want to use A10s for FP64, it is of course possible just may not give you the expected performance because that's not what they were designed for.

mnmueller commented 2 years ago

I agree that the A10's performance characteristics are not ideal for certification. I believe last year, many benchmarks were certified using FP32 by some participants. Especially, if we were to allow participants to choose instances per benchmark they might be a nice middle-ground between a purely GPU focused instance and a CPU-only instance.

In any case, I think we should keep the p3.2xlarge as an option.

stanleybak commented 2 years ago

On Monday, I have a meeting with a sales representative from AWS to increase our quota. Can we agree on some set of instances by then, or should I reschedule?

Three instance types will be p3.2xlarge (GPU), m5.16xlarge (CPU), and g3.4xlarge (Balanced). If we can get at least one of each we'll get by. 3 of each would be better, and if they're generous I don't anticipate we'll need more than 5 of each.

j29scott commented 2 years ago

Hi all, sorry for the delay, I realize rules have been finalized.

Our group has been building an algorithm selection driven meta-solver, i.e., a solver that leverages machine learning to determine which algorithms to use for a particular instance. These techniques are extremely powerful in practice. However, they are often controversial in competition environments.

However, "static" algorithm selection seems to exist in several solvers from last year on benchmark categories. Specifically, several solvers I noticed had different configurations depending on the type of benchmark, since it was known beforehand. We would like to participate in the competition. Would this be alright?

Based on my interpretation from the rules, it appears to be legal?

mnmueller commented 2 years ago

I think the participant list was finalized on the 15th of April, so the first question is if we can, generally, accept new participants for this year's iteration.

Regarding the second question on the permissibility of meta-solvers: I believe that it is not entirely clear whether the current rules permit meta solvers, as the line between "manually" selecting settings per instance and using an "algorithm" is quite blurry, as one could make said algorithm so specific (e.g. building a decision tree) as to achieve the same result. My opinion is that as long as the meta-solver only leverages algorithms developed by the participant or at least none that are also entered by the original developers, selecting different algorithms would be permissible. If however, they were to simply adaptively choose between the top k algorithms from last year, this would violate the spirit of the competition. In either case, I would suggest that benchmarks should not be constructed to be as diverse as possible (e.g. encompass completely different networks) and therefore heavily favor such a dynamic algorithm selection.

What do other people think?

mnmueller commented 2 years ago

Also I just realized that somewhere in the discussion about different AWS instances g3 and g5 got mixed up. We would need the g5.4xlarge and not the g3. The M60s of the g3 are quite weak and would represent a significant step down from last years V100.

huanzhang12 commented 2 years ago

Ahh yes, I meant g5 not g3 earlier (it was a typo in my comment), thanks @mnmueller for catching this! Hope it was clear following the context and we got the right instance type reserved.

huanzhang12 commented 2 years ago

And to be exact we actually need g5.8xlarge (in my comment, I tested on g5.4xlarge because it has the same GPU and cheaper) but for competition we need to reserve 8xlarge (this is mentioned in the rule doc I think). Apologize for all the confusions!

sergedurand commented 2 years ago

Hi all. We are working on a verifier but it is not open source at the moment. Is it possible to participate with a non open-source tool? How would we go about it in this case?

stanleybak commented 2 years ago

In terms of new tools, I think this is still okay despite the earlier registration "deadline". Actually we plan to have some flexibility in the deadlines to encourage participation. I will update these. Please start to submit benchmarks on the other github issue though!

@huanzhang12 I updated the rules doc to list the three instance types: CPU: m5.16xlarge, $3.072 / hour, 64 vCPU, 256 GB memory, GPU: p3.2xlarge, $3.06/hour, 8vCPUs, 61 GB memory, 1x V100 GPU, Balanced: g5.8xlarge, $2.44 per hour, 32 vCPUs, 128 GB memory

@j29scott New tool is fine, but maybe since it seems like a meta tool we will include the results in the report but maybe exclude it from awards this year (maybe we need a new category for something like this next time, please remind us to discuss this at the workshop). please fill out the google form to register asap

@sergedurand I think as along as you put things on a github link and the tool competition scripts you're okay. For example, you tool installation script can use wget to download your tool's compiled executable. This may make things harder to debug for you. As mentioned above, please fill out the tool registration form ASAP.

ChristopherBrix commented 2 years ago

In my meeting with AWS, I requested p3.2xlarge (GPU), m5.16xlarge (CPU), and g3.4xlarge (Balanced) as stated by Stanley back then. I'll pass along our request to change g3.4xlarge to g5.8xlarge

sergedurand commented 2 years ago

@sergedurand I think as along as you put things on a github link and the tool competition scripts you're okay. For example, you tool installation script can use wget to download your tool's compiled executable. This may make things harder to debug for you. As mentioned above, please fill out the tool registration form ASAP.

Thanks, we should be able to do it this way. I have filled the form some time ago, the name of the tool is PyRAT.