Import Watts tests - Githubissues

jerhard commented 2 years ago

This PR imports the tests from the Watts repository, which seem to be the ones used for this paper: Flow-sensitive composition of thread-modular abstract interpretation. Kusano and Wang.

The benchmark script is executed with update_bench_traces_rel_watts.rb.

TODO:

[x] Check whether some of the assert(0)s should actually be reachable.
[x] Remove duplicates: Most example programs seem to be duplicated, where in the second version the include <assert.h> is commented out.

jerhard commented 2 years ago

I had a look at all the tests where the mutex-meet-tid configuration produces one or more Unknown assertions, or where we consider an assert(0) as potentially reachable. Below is a short description of my investigation. In some cases, the asserts seem to be unknown in the concrete, or reachable locations are marked with assert(0). Still, there are some interesting cases which we cannot do, at least not with mutex-meet-tid. For example intra01/main.c, fk-cmp01/main.c, and flagLock01/main.c.

Failing assertions:

intra01/main.c: Program flow can be excluded by pre-condition that write from other thread already happened and was overwritten

intra01/main2.c: same as intra01/main.c

rev01/main.c Reachable(!) location is annotated with assert(false);

intra01_join/main.c: same as intra01/main.c

thread01/main.c same as intra01/main.c

thread01/main2.c same as intra01/main.c

sync_01_true/main.c write by other thread can be assumed as flag variable written after the write is checked

sync_01_true/main2.c same as sync_01_true/main.c

sync_02_true/main.c same as sync_01_true/main.c

syn_02_true/main2.c same as sync_01_true/main.c

keybISR/main.c similar to sync_01_true/main.c

keybISR/main2.c similar to sync_01_true/main.c

wdt977_02/main.c The test is at fault: clearly the intention was to have an assert(0) guarded by an if statement in lines 302-304, but the if is followed by a semicolon, making the assert unguarded. Thus, the assert(0) is reachable in the concrete.

wdt977_02/main2.c Same as wdt977_02/main.c

Unknown assertions:

fk-cmp01/main.c write by other thread can be assumed as flag variable written after the write is checked

fk2012/main.c The assert statement is wrong (not ensured in the concrete).

flagLock01/main.c Thread_joins on two threads ensure that both must have incremented the counter

i8xx_tco_01/main3.c _The assert in line 354 does not seem to be ensured in the concrete, as the thread main2 may write tco_expectclose directly before the read.

tso_na_01/main.c The assert seems not to be ensured in the concrete. The execution of y = 1; b = x; x = 1; a = y; should result in an assert failure.

sim642 commented 2 years ago

Their smaller non-DDVerify benchmarks come from SV-COMP and I do recall seeing similar ones there, when searching for benchmarks. Since they're largely boolean flag based (and use these to model locks, etc), we obviously won't be able to solve them with just Apron. Looks like they used Z3, which can handle logical relations for them. But given that those are smaller compared to the DDVerify ones (where it sounds like we are more successful), the overall total we can prove might be quite good still, no?

jerhard commented 2 years ago

The test case wdt977_02/main.c were we have 30 failing asserts is actually broken. The assert we report as failing is the following: https://github.com/goblint/bench/blob/d80bb6c3d3f123b871cf7d7f5c22b86c41744afc/watts/wdt977_02/main.c#L302-L304 The intention most likely was to have a guarded assert(0). But the if is followed by a semicolon, so the assert is actually not guarded at all.

jerhard commented 2 years ago

But given that those are smaller compared to the DDVerify ones (where it sounds like we are more successful), the overall total we can prove might be quite good still, no?

Yes, we have failing/unknown assert on 16 out of the 96 benchmarks and succeed on the other 80. And for 5 of those, it is not ensured in the concrete that the assert will succeed.

sim642 commented 2 years ago

Is there anything to still do here or can we merge this such that all the relational traces benchmarks are finally together in one place?

jerhard commented 2 years ago

Yes, we can merge this.

goblint / bench

Import Watts tests #29

Failing assertions:

Unknown assertions: