verivital / vnn-comp

12 stars 8 forks source link

Submission #8

Open pat676 opened 4 years ago

pat676 commented 4 years ago

Hi,

Sorry if this information has already been provided, but I can't find detailed submission instructions. If no details have been published yet, my suggestion is this:

1) Fork the repo and create a folder vnn-comp/

2) The folder should contain:

3) The output of the run_benchmarks scripts should be A latex-compiled pdf file: vnn-comp//results.pdf containing:

4) Create a pull request.

Also, two more questions:

  1. The VNN-COMP website mentions a shared overleaf project, has this been created yet?
  2. We do not have authorisation at present to deposit the toolkit's source code on GitHub. We plan simply to add a readme with downloading instruction from Imperial's web site. Please let us know if there are issues with this.
vtjeng commented 4 years ago

@pat676 how are you planning to programmatically generate the .tex files? I don't know any options in Julia (which is what my verification tool is written in) so I was planning to simply use PyLaTeX, but I'm wondering whether you know of anything that works better.

pat676 commented 4 years ago

@pat676 how are you planning to programmatically generate the .tex files? I don't know any options in Julia (which is what my verification tool is written in) so I was planning to simply use PyLaTeX, but I'm wondering whether you know of anything that works better.

Hi @vtjeng,

For this round, I just created the functionality for tex tables from scratch and wrote the main.tex importing these tables manually. In case you want to use the same, I've shared a dropbox folder with the relevant code for tex tables and plots. The input to the "write_table" function is statuses/times/img_idx/epsilons in the same shape as the final latex table. The tables should be at most 50 rows to fit in one page, so for benchmarks with more than 50 cases I suggest splitting them up into several columns.

*** Edit. The tables and plots may not be suitable for the ACAS and convolutional oval benchmarks .

ttj commented 4 years ago

Sorry if this information has already been provided, but I can't find detailed submission instructions. If no details have been published yet, my suggestion is this:

  1. Fork the repo and create a folder vnn-comp/

This is fine, modulo a suggestion below to put things into categories. If participating in only one category, please include things in that category folder. If participating in more than one, putting any additional requirements (code, etc.) shared across categories at the root is fine.

  1. The folder should contain:
  • A readme with instructions on how to set up the environment.
  • A shell script: run_benchmarks to run all the reported benchmarks

Please put the scripts with respect to the category in vnn-comp/category/toolkit_name, e.g., `vnn-comp/CNN/nnv/.' for NNV's batch scripts for the CNN category.

  1. The output of the run_benchmarks scripts should be A latex-compiled pdf file: vnn-comp//results.pdf containing:

CSV or similar is also acceptable, but of course having it easily integrated into the report with the source latex would be ideal. Also note the comment above regarding category, so this would go to e.g. vnn-comp/category/toolkit_name/results.pdf (or .tex/.csv if that's also done), concretely vnn-comp/CNN/nnv/results.pdf for the CNN category results.

  • 1-2 paragraphs about the toolkit, testing environment (Processor, Ram), operating system and other relevant information.

There is a section in the overleaf to add this information.

  • The pdf file should contain tables with each image number/verification results/time spent/epsilon
  • Preferably one plot for each network with the number of safe, unsafe and total amount of cases solved vs time. (ref: #2 (comment))
  1. Create a pull request.

Also, two more questions:

  1. The VNN-COMP website mentions a shared overleaf project, has this been created yet?

Yes, I've shared this with some, if you or your collaborators don't have access, please email me with your overleaf account email as I didn't want to publicly share an editable link.

  1. We do not have authorisation at present to deposit the toolkit's source code on GitHub. We plan simply to add a readme with downloading instruction from Imperial's web site. Please let us know if there are issues with this.

That is fine.

vtjeng commented 4 years ago

In the results section, what section type do want to use under subsubsection (e.g. under 4.2.1)? It doesn't look like there's a consistent answer on how to handle that.

stanleybak commented 4 years ago

I made some latex code to allow people to define their tool results in separate files in the overleaf. They will get added as separate columns in the same table. This way we don't need an external processing step that requires everyone's results, just a single data file per tool/benchmark. See data/acasxu-all-nnenum.txt for an example data file.

ttj commented 4 years ago

I made some latex code to allow people to define their tool results in separate files in the overleaf. They will get added as separate columns in the same table. This way we don't need an external processing step that requires everyone's results, just a single data file per tool/benchmark. See data/acasxu-all-nnenum.txt for an example data file.

Thanks! So, I think this is partly problematic the way it's set up. The individual sat/unsat results are not included per tool, so how do we know if an individual tool was right/wrong or if results differed across tools? It probably would have been better to include a csv/similar for each tool consisting of the result and time instead of just the time (as I think previously we had discussed). If such files already exist, please point them out.

stanleybak commented 4 years ago

Yes you're right. Right now it's up to the tool authors to check for consistency with the results column.

I think the pgfplotstable package is powerful enough to be able to produce the results column automatically assuming each tool provides two columns (time / result) and detect inconsistencies automatically (for example simple computations, see here), but I didn't have time to play with the package enough to do this. It also should be able to auto-format the times to two decimal places or significant digits for us, rather than having to do it ourselves.

For now, you can define two columns in each .dat file since the code is just pulling out the first column ("[index] 0"): \pgfplotstablecreatecol[copy column from table={\dataAcasHardNnenum}{[index] 0}] {nnenum} {\dataAcasHard}

See data/acasxu-hard-nnenum.txt for an example of this. It should be backwards compatible with people who just have a single column. Note that you will have to add (unused) column names in the first row in the data file if you do this, since the first row is no longer just numerical (the result column is a string). Otherwise pgfplotstable will omit the first row of the data file assuming it's the column names. There are probably some configuration options to change this.

stanleybak commented 4 years ago

I made some scripts to plot the results as discussed in https://github.com/verivital/vnn-comp/issues/2#issuecomment-643642860. I'm still tweaking things but it should be easy to make it work for each category using the data files used in the tables.

Doing a quick google search it looks like cactus plots usually have timeout on the y axis. Is this correct? Seems strange to me but it's easy enough to switch.

vtjeng commented 4 years ago

@ttj I've been working off of @pat676's suggestion to generate a PDF report for each benchmark. Should I be generating a CSV with one column as suggested by @stanleybak's instead now? What is the convention for the ordering of the rows for each of the benchmarks to make sure that the results are compatible?

ttj commented 4 years ago

@ttj I've been working off of @pat676's suggestion to generate a PDF report for each benchmark. Should I be generating a CSV with one column as suggested by @stanleybak's instead now? What is the convention for the ordering of the rows for each of the benchmarks to make sure that the results are compatible?

Please take a look at the current overleaf draft, if anyone needs access, please let me know the email to invite. In the interest of time, I would say either is okay for now, but if possible, it would be nice to have these aggregated as in the existing columns that appear in the current draft, as this will make it easier to compare things (and the ordering is defined there).

vtjeng commented 4 years ago

@stanleybak this is a minor issue but I haven't been able to figure out how to right-align my results. Let me know whether I need to change something about my input data.

stanleybak commented 4 years ago

I've just been manually setting each column with display columns/4/.style={column type = {r}}, in the table, where 4 is the column number and 'r' means right align. There's probably a way to set a default but I couldn't find it right away and this works well enough.

vtjeng commented 4 years ago

I'm not sure I'll manage to get my code working in time on the CIFAR10 CNN networks from #3 that @GgnDpSngh proposed. How should I report it in that case?

stanleybak commented 4 years ago

I would say just skip it in the tables and maybe report in the description which ones you analyzed, although @ttj can confirm. It's an imperfect comparison for a number of reasons (different evaluation platforms, for one). We can discuss how to better arrange things for next year during the workshop.

ttj commented 4 years ago

I would say just skip it in the tables and maybe report in the description which ones you analyzed, although @ttj can confirm. It's an imperfect comparison for a number of reasons (different evaluation platforms, for one). We can discuss how to better arrange things for next year during the workshop.

Yes, that's fine, and I'll make a note of this in the presentation as well

ttj commented 4 years ago

@FlorianJaeckle could you please add the computer details (cpu/ram) used to the participant info on Oval in the report?

vtjeng commented 4 years ago

@ttj someone left a comment about providing details about the participants; where should we do this? (In my case it was just me).

ttj commented 4 years ago

@ttj someone left a comment about providing details about the participants; where should we do this? (In my case it was just me).

I added that comment to hopefully avoid not acknowledging anyone in the talk, and for finalizing the report. I added placeholders in each tool's description subsection and updated those I knew offhand, please check/update as needed.

vtjeng commented 4 years ago

Thanks @ttj. @stanleybak --- I have some results that I just added for MNIST 0.3 and CIFAR 8/255 for the GGN-CNN networks after finally fixing my bug --- hope that you can update the cactus plot with that information tomorrow!

stanleybak commented 4 years ago

Done! I'll probably do one last update with the latest data at noon today---the deadline Taylor set---in case anyone else made changes. Also, if people see mistakes in the plots let me know.

The code to make the plots is in figure/plot_code if anyone is interested. You can download the project .zip file (from the history tab in overleaf), and then go to the plot_code directory and run python3 make_gnuplot_inputs.py ../../data/ followed by gnuplot make_plots.gnuplot.

vtjeng commented 3 years ago

Thanks for the instructions @stanleybak.

@ttj, where will the final copy of the report be made available? (I'd like to share the report publicly when it is appropriate to do so).

GgnDpSngh commented 3 years ago

Hi there,

As @vtjeng mentioned, it would be good to release the report soon. Also, in the report, we should rename "GGN-CNN" benchmarks to "COLT-CNN" as they are taken from a project in which I was not involved. I can do such a change if needed.

Cheers, Gagandeep Singh

stanleybak commented 3 years ago

I agree an "official" release of the report would be useful.

Also @GgnDpSngh, do you have the scripts to run ERAN on the VNN-COMP benchmarks? I didn't see it in the repository.

GgnDpSngh commented 3 years ago

Hi Stanley,

Yes, I will send a pull request by Thursday. Does it have to be a Docker image or only the scripts to run ERAN on the benchmarks suffice?

Cheers, Gagandeep Singh

stanleybak commented 3 years ago

I think the original instructions said to use a Dockerfile for the setup, although @ttj can answer if there were any changes.

GgnDpSngh commented 3 years ago

There seems to be an issue with running the Gurobi license inside the Docker Image which I could not fix yet (maybe someone here knows to resolve it), but I have created a version of ERAN for reproducing competition results here:

https://github.com/GgnDpSngh/ERAN-VNN-COMP

I can send a pull request if @ttj is fine with it.

Cheers, Gagandeep Singh

vtjeng commented 3 years ago

I've submitted #18 containing the code that I used for this submission. (No substantial changes to the logic; I primarily did some cleanups to the README, and removed some initial code I had to generate reports.)

ttj commented 3 years ago

Thanks, sorry for the slow follow-up, have been swamped getting ready for classes. I merged #18. If any changes are needed to the report, such as updating benchmark names, etc., mentioned above, please do these, otherwise I will add some conclusions drawn and finalize it in the next 1-2 weeks. There is no official proceedings, so if you need to cite things, just cite this repository or the VNN website.

stanleybak commented 3 years ago

Is a copy of the report pdf available?