Open mldiego opened 8 months ago
From last year, we have modified the specifications (properties) to 3 benchmarks:
Hi, we would like to propose a navigation benchmark where an agent has to reach a goal set while avoiding some unsafe set along the way.
I will upload the files in the next couple of days.
I have some suggestions:
Hello,
We would like to propose a cartpole balancing benchmark. A single pole is mounted on a movable cart, with an upright starting position. The controllers goal is to balance the pole in the center of the track. The safety goal of this 4-dimensional problem would be a stable position after some swing-in time.
https://github.com/verivital/ARCH-COMP2024/assets/159032400/bb2da960-2e76-419b-afde-3835178e803d
We will upload the required files shortly.
Thanks for the suggestions.
VERIFIED
/FALSIFIED
/UNKNOWN
as text. If the specification was modified, this can be indicated by a star (VERIFIED*
), and what was changed is further described in the report. I also want to bring up another point:
As we are not limited to numbers, we can just output
VERIFIED
/FALSIFIED
/UNKNOWN
as text.
I prefer numbers because
verified
/Verified
/VERIFIED
insonsistently. Numbers avoid the capitalization question altogether.So we should aim to always offer at least ONNX.
Yes, I wanted to have more formats, not fewer. And my main goal is to have some formats supported across all benchmarks so that you do not need a different parser for a different benchmark.
The submission system now allows for an automatic conversion of figures to latex/tikz, which would help to unify the look of the report. Tikz figures are high-quality vector graphics (see CORA figures from last report), which can easily be changed afterward (axis limits, color, even combining results from different tools), which would really improve the report overall. If you want your figures to be converted, just provide your figure in json (format specified on the website) or as matlab figure. I attached an example json file. Let me know if you have any questions.
That sounds exciting. I guess I have to wait until the page is online again to see whether the format is clear.
Can you provide the converter from JSON to TikZ as a standalone tool? That would generally be interesting and help writing such a JSON writer.
@schillic I have attached the parser from Matlab figure to json and vice-versa. I hope they are helpful to you. The conversion to tikz is then done using a fork of matlab2tikz, where we apply some tricks for matlab2tikz to better process the matlab figures (e.g., unifying provided time interval solutions, removing collinear points up to a given precision, etc.) fig2json.zip
I added a pull request for the new navigation benchmark. https://github.com/verivital/ARCH-COMP2024/pull/2
I also organized the benchmarks into a table in the readme, as I think this might be clearer to see which benchmarks were modified etc. I was also debating on specifying the dimensions to plot for the figures here, but I guess they rather belong to the respective benchmark. Let me know if that is sensible.
thanks! I have merged the pull request
As mentioned before, we would like to propose a balancing benchmark. I created a Pull Request containing all files needed for this benchmark. Feel free to comment and/or demand changes for this benchmark to be complete.
This question has come up from @SophieGruenbacher , and as it seems the repeatability server (and its instructions) still down, @toladnertum can you please answer?
In general, the typical repeatability instructions have been to provide an installation script as Dockerfile, and script to reproduce everything. I did not use the repeatability server last year, so not sure its interfacing (with detailed instructions here, we did not typically do automatic table creation in many categories and basically created reachable set plots: https://gitlab.com/goranf/ARCH-COMP/-/tree/master/#repeatability-evaluation-instructions-2021-and-2022 ).
"as we are preparing our package for the repeatability evaluation and we have a question about the desired output of the tool. In the docs it’s written that "The executable should be accompanied by instructions on how to run the tool, what OS is used, and how to interpret the result of the tool (whether the result is safe or unsafe)“.
So would the following output be what you are expecting?
@SophieGruenbacher you can take a look at the README of our repeatability package from last year here.
To summarize: The server will run the Docker script submit.sh
, which should create a file results.csv
with the results. I attach an example below.
benchmark,instance,result,time
ACC,relu,verified,0.621668113
ACC,tanh,verified,0.424987778
Airplane,,falsified,5.214569214
This corresponds to this table:
benchmark | instance | result | time |
---|---|---|---|
ACC | relu | verified | 0.621668113 |
ACC | tanh | verified | 0.424987778 |
Airplane | falsified | 5.214569214 |
TLDR: We finally got the ARCH submission system back up and running. 🎉 Before going live on the real URL, I would be happy if you could test the system with a short submission here: https://arch.repeatability.cps.cit.tum.de/frontend @ttj @schillic @SophieGruenbacher If you already used the submission system last year, you unfortunately have to recreate an account, and this will be the same account on the live version later.
Longer explanation: As I already told you, our server got infected by some malware, and we had several issues following that event. These issues should be solved now. The current version is a slightly improved version of last year's system, but, e.g., does not have the automatic figure conversion yet. I am not sure if we can make it given the remaining time and inevitable bugs occurring from such an extension. We could also do a test run for the figure conversion this year in the AINNCS category only, and I convert the outputted json files by each tool locally on my machine and add them to the AINNCS report. We might also provide this conversion as a standalone tool if it turns out well. Further information can be found here: https://arch.repeatability.cps.cit.tum.de/frontend/arch-report
If you have any questions, don't hesitate to contact me!
I submitted our NLN package. Let's see what happens.
Great! How long do you expect the submission to run?
It has finished and I just published the results. Note that you have to do this manually in "My Submissions".
Great. We could also move "My Submission" into the general header, I guess that would be clearer then.
One can now also download the results folder by clicking on the respective name.
I cannot log in with the old account. I thought you said that this account would work. Is that expected?
That was the plan, but unfortunately, we had to clear the database. I think you can just create a new one.
We plan to reuse benchmarks from 2023, but feel free to propose new benchmarks or modifications to existing ones.