Benchmark Discussion - Githubissues

mldiego commented 6 months ago

We plan to reuse benchmarks from 2023, but feel free to propose new benchmarks or modifications to existing ones.

mldiego commented 4 months ago

From last year, we have modified the specifications (properties) to 3 benchmarks:

Airplane
Double Pendulum
Single Pendulum

toladnertum commented 4 months ago

Hi, we would like to propose a navigation benchmark where an agent has to reach a goal set while avoiding some unsafe set along the way.

I will upload the files in the next couple of days.

schillic commented 4 months ago

I have some suggestions:

About the submission system https://arch.repeatability.cps.cit.tum.de/frontend/submissions (At the time of writing, the page was down. But it used to work recently.)
1. We should agree on instance names so that a comparison of the tools is easier.
2. Should an unsuccessful analysis / unknown result be reported in the table? Currently we use the values 0 and 1. But then you cannot distinguish between verified/falsified/unknown, so I think it would be good to also have a third value (probably -1). For instance, JuliaReach currently sometimes uses a smaller initial set to still be able to verify a model, and then uses the instance name "small X0" to indicate that. But this may be misleading when comparing tools in the table, and may go against item 1.i above (in our case currently it does not, though). Is there an opinion on that? I am fine keeping it as it is, but wanted to bring it up.
On the report: We need to agree on what time to report in Table 2 for the VCAS model (could be the sum or the average for a scenario that all/most tools can deal with).
On the network formats: Currently these are mixed. Most models have a .mat file and an ONNX file. I suggest to have at least one alternative readable network format for the relatively simple architectures we currently use. Sometimes we have networks in .nnet format, which is readable but only applies to ReLU networks. I propose the format used by POLAR, which is a direct extension of the Sherlock format to arbitrary activations. In JuliaReach we can read all formats used here (except for the more complex ONNX models) and also have a printer for the POLAR format, so I can offer to provide the networks in that format.

manuelkran commented 4 months ago

Hello,

We would like to propose a cartpole balancing benchmark. A single pole is mounted on a movable cart, with an upright starting position. The controllers goal is to balance the pole in the center of the track. The safety goal of this 4-dimensional problem would be a stable position after some swing-in time.

Screenshot from 2024-04-26 08-37-03

https://github.com/verivital/ARCH-COMP2024/assets/159032400/bb2da960-2e76-419b-afde-3835178e803d

We will upload the required files shortly.

toladnertum commented 4 months ago

Thanks for the suggestions.

The submission system should be up and running again very soon. Our VM running the system got infected by a virus. Thus, we took it down temporarily. i. I agree. @mldiego, can you create a table somewhere where we specify the exact names of the benchmark/instance combinations? ii. As we are not limited to numbers, we can just output VERIFIED/FALSIFIED/UNKNOWN as text. If the specification was modified, this can be indicated by a star (VERIFIED*), and what was changed is further described in the report.
I think the VCAS benchmark could also be reduced to two or three instances, as it is currently effectively eight times the same thing with minor changes, and then report them individually
Regarding the network formats: Of course, the more we offer, the merrier. I think ONNX is widely used also outside our community, and our controllers usually don't have complex architectures. So we should aim to always offer at least ONNX.

I also want to bring up another point:

The submission system now allows for an automatic conversion of figures to latex/tikz, which would help to unify the look of the report. Tikz figures are high-quality vector graphics (see CORA figures from last report), which can easily be changed afterward (axis limits, color, even combining results from different tools), which would really improve the report overall. If you want your figures to be converted, just provide your figure in json (format specified on the website) or as matlab figure. I attached an example json file. Let me know if you have any questions. cora_01_Unicycle_01.json

schillic commented 4 months ago

As we are not limited to numbers, we can just output VERIFIED/FALSIFIED/UNKNOWN as text.

I prefer numbers because

I do not like all-caps too much and some tools may write verified/Verified/VERIFIED insonsistently. Numbers avoid the capitalization question altogether.
Ideally, this column should be consistent across categories.
Numbers are back-compatible to previous years.

So we should aim to always offer at least ONNX.

Yes, I wanted to have more formats, not fewer. And my main goal is to have some formats supported across all benchmarks so that you do not need a different parser for a different benchmark.

The submission system now allows for an automatic conversion of figures to latex/tikz, which would help to unify the look of the report. Tikz figures are high-quality vector graphics (see CORA figures from last report), which can easily be changed afterward (axis limits, color, even combining results from different tools), which would really improve the report overall. If you want your figures to be converted, just provide your figure in json (format specified on the website) or as matlab figure. I attached an example json file. Let me know if you have any questions.

That sounds exciting. I guess I have to wait until the page is online again to see whether the format is clear.

Can you provide the converter from JSON to TikZ as a standalone tool? That would generally be interesting and help writing such a JSON writer.

toladnertum commented 4 months ago

@schillic I have attached the parser from Matlab figure to json and vice-versa. I hope they are helpful to you. The conversion to tikz is then done using a fork of matlab2tikz, where we apply some tricks for matlab2tikz to better process the matlab figures (e.g., unifying provided time interval solutions, removing collinear points up to a given precision, etc.) fig2json.zip

toladnertum commented 4 months ago

I added a pull request for the new navigation benchmark. https://github.com/verivital/ARCH-COMP2024/pull/2

I also organized the benchmarks into a table in the readme, as I think this might be clearer to see which benchmarks were modified etc. I was also debating on specifying the dimensions to plot for the figures here, but I guess they rather belong to the respective benchmark. Let me know if that is sensible.

ttj commented 4 months ago

thanks! I have merged the pull request

manuelkran commented 4 months ago

As mentioned before, we would like to propose a balancing benchmark. I created a Pull Request containing all files needed for this benchmark. Feel free to comment and/or demand changes for this benchmark to be complete.

ttj commented 4 months ago

This question has come up from @SophieGruenbacher , and as it seems the repeatability server (and its instructions) still down, @toladnertum can you please answer?

In general, the typical repeatability instructions have been to provide an installation script as Dockerfile, and script to reproduce everything. I did not use the repeatability server last year, so not sure its interfacing (with detailed instructions here, we did not typically do automatic table creation in many categories and basically created reachable set plots: https://gitlab.com/goranf/ARCH-COMP/-/tree/master/#repeatability-evaluation-instructions-2021-and-2022 ).

"as we are preparing our package for the repeatability evaluation and we have a question about the desired output of the tool. In the docs it’s written that "The executable should be accompanied by instructions on how to run the tool, what OS is used, and how to interpret the result of the tool (whether the result is safe or unsafe)“.

So would the following output be what you are expecting?

safe: yes/no
logged time
counter-example (if we find one)
if safe return image, if unsafe return time-step of first intersection of tube with unsafe states"

schillic commented 4 months ago

@SophieGruenbacher you can take a look at the README of our repeatability package from last year here.

To summarize: The server will run the Docker script submit.sh, which should create a file results.csv with the results. I attach an example below.

benchmark,instance,result,time
ACC,relu,verified,0.621668113
ACC,tanh,verified,0.424987778
Airplane,,falsified,5.214569214

This corresponds to this table:

benchmark	instance	result	time
ACC	relu	verified	0.621668113
ACC	tanh	verified	0.424987778
Airplane		falsified	5.214569214

toladnertum commented 4 months ago

TLDR: We finally got the ARCH submission system back up and running. 🎉 Before going live on the real URL, I would be happy if you could test the system with a short submission here: https://arch.repeatability.cps.cit.tum.de/frontend @ttj @schillic @SophieGruenbacher If you already used the submission system last year, you unfortunately have to recreate an account, and this will be the same account on the live version later.

Longer explanation: As I already told you, our server got infected by some malware, and we had several issues following that event. These issues should be solved now. The current version is a slightly improved version of last year's system, but, e.g., does not have the automatic figure conversion yet. I am not sure if we can make it given the remaining time and inevitable bugs occurring from such an extension. We could also do a test run for the figure conversion this year in the AINNCS category only, and I convert the outputted json files by each tool locally on my machine and add them to the AINNCS report. We might also provide this conversion as a standalone tool if it turns out well. Further information can be found here: https://arch.repeatability.cps.cit.tum.de/frontend/arch-report

If you have any questions, don't hesitate to contact me!

schillic commented 4 months ago

I submitted our NLN package. Let's see what happens.

toladnertum commented 3 months ago

Great! How long do you expect the submission to run?

schillic commented 3 months ago

It has finished and I just published the results. Note that you have to do this manually in "My Submissions".

toladnertum commented 3 months ago

Great. We could also move "My Submission" into the general header, I guess that would be clearer then.

One can now also download the results folder by clicking on the respective name.

toladnertum commented 3 months ago

We are live: https://arch.repeatability.cps.cit.tum.de/frontend/getting-started

schillic commented 3 months ago

I cannot log in with the old account. I thought you said that this account would work. Is that expected?

toladnertum commented 3 months ago

That was the plan, but unfortunately, we had to clear the database. I think you can just create a new one.

verivital / ARCH-COMP2024

Benchmark Discussion #1