Mislabeled (buggy) LDV benchmarks

zvonimir commented 7 years ago

Here is the list of LDV benchmarks that almost certainly access unallocated memory, which can overwrite anything, which ultimately typically leads to a bug being reported:

ldv-challenges/linux-3.14__linux-usb-dev__drivers-media-usb-hdpvr-hdpvr_true-unreach-call.cil.c
ldv-challenges/linux-3.14__linux-alloc-spinlock__drivers-net-xen-netfront_true-unreach-call.cil.c
ldv-consumption/32_7a_cilled_true-unreach-call_linux-3.8-rc1-32_7a-drivers--pcmcia--pcmcia.ko-ldv_main2_sequence_infinite_withcheck_stateful.cil.out.c
ldv-linux-3.16-rc1/205_9a_array_safes_linux-3.16-rc1.tar.xz-205_9a-drivers--net--usb--rtl8150.ko-entry_point_true-unreach-call.cil.out.c
ldv-challenges/linux-3.14__complex_emg__linux-drivers-clk1__drivers-net-ethernet-ethoc_true-unreach-call.cil.c
ldv-challenges/linux-3.8-rc1-32_7a-drivers--md--md-mod.ko-ldv_main0_sequence_infinite_withcheck_stateful_true-unreach-call.cil.out.c
ldv-challenges/linux-3.14__complex_emg__linux-drivers-clk1__drivers-net-ethernet-smsc-smsc911x_true-unreach-call.cil.c
ldv-consumption/32_7a_cilled_true-unreach-call_linux-3.8-rc1-32_7a-sound--core--seq--snd-seq.ko-ldv_main2_sequence_infinite_withcheck_stateful.cil.out.c
ldv-challenges/linux-3.14__linux-drivers-clk1__drivers-net-ethernet-smsc-smsc911x_true-unreach-call.cil.c
ldv-linux-3.7.3/linux-3.10-rc1-43_1a-bitvector-drivers--atm--he.ko-ldv_main0_true-unreach-call.cil.out.c
ldv-challenges/linux-3.14__complex_emg__linux-alloc-spinlock__drivers-net-xen-netfront_true-unreach-call.cil.c
ldv-challenges/linux-3.14__linux-drivers-clk1__drivers-net-ethernet-cadence-macb_true-unreach-call.cil.c
ldv-linux-4.2-rc1/linux-4.2-rc1.tar.xz-08_1a-drivers--media--platform--exynos4-is--s5p-fimc.ko-entry_point_true-unreach-call.cil.out.c
ldv-linux-3.16-rc1/43_2a_bitvector_linux-3.16-rc1.tar.xz-43_2a-drivers--net--xen-netfront.ko-entry_point_true-unreach-call.cil.out.c
ldv-consumption/32_7a_cilled_true-unreach-call_linux-3.8-rc1-32_7a-drivers--media--dvb-core--dvb-core.ko-ldv_main5_sequence_infinite_withcheck_stateful.cil.out.c
ldv-linux-4.2-rc1/linux-4.2-rc1.tar.xz-08_1a-drivers--video--fbdev--core--fb.ko-entry_point_true-unreach-call.cil.out.c
ldv-linux-4.2-rc1/linux-4.2-rc1.tar.xz-32_7a-drivers--net--usb--r8152.ko-entry_point_true-unreach-call.cil.out.c
ldv-consumption/32_7a_cilled_true-unreach-call_linux-3.8-rc1-32_7a-drivers--gpu--drm--ttm--ttm.ko-ldv_main5_sequence_infinite_withcheck_stateful.cil.out.c
ldv-linux-3.4-simple/32_1_cilled_true-unreach-call_ok_nondet_linux-3.4-32_1-drivers--media--dvb--dvb-usb--dvb-usb-az6007.ko-ldv_main0_sequence_infinite_withcheck_stateful.cil.out.c
ldv-challenges/linux-3.14__complex_emg__linux-kernel-locking-mutex__drivers-media-common-saa7146-saa7146_vv_true-unreach-call.cil.c
ldv-challenges/linux-3.14__complex_emg__linux-drivers-clk1__drivers-net-ethernet-cadence-macb_true-unreach-call.cil.c

Now that we removed the simple memory model requirement, we should address all these issues somehow. How should we go about this?

The simple answer is: we should just fix the benchmarks. However, finding violating memory accesses and figuring out how to actually properly allocated them is very time consuming. Based on my experience from last year, it takes me several hours per benchmark to do that properly, if not more.

Alternatively, we could change the labels of these benchmarks to false. Maybe to start with, it would be great if others could confirm that they are observing (or not) same issues with the above benchmarks.

Finally, we could exclude these benchmarks from the competition this year until they get fixed.

Please let us know your thoughts.

mutilin commented 7 years ago

Each year we have the same problem with undefined behavour in LDV benchmarks. As I remember @dbeyer said (27.09.2015)

Note that the specification is that a call to __VERIFIER_error() should never occur. A verifier should not terminate immediately with FALSE if it encounters an odd memory write.

If the specification is that no uninitialized pointer is written to, then the verifier should answer FALSE, definitely. We could include a property that looks for such errors, but there are verification tasks that have such writes but were made for other intended verification challenges.

The benchmarks do not contain violation of reachability of function __VERIFIER_error(). Hence we could not change the labels of these benchmarks to false.

Earlier you proposed (@zvonimir 30.10.2016)

I suggest we remove it from both Simple and DeviceDrivers categories. SMACK has never leveraged the simple memory model, and yet at this point it really does not report almost any false bugs in these categories. So it should be pretty safe to remove the simple memory model requirement from these two categories. We can fix any potential issues this might cause as we move along since I believe there might be only a handful of these.

I think that removal of Simple memory model is the rigth step, but not the solution for the problem. I see the way to resolve it in the future by defining the meaning of external functions. See proposal https://docs.google.com/document/d/1T2j3asG6wl1FM28JXGTmrri8nbnDMdPUVtCB0ZMBK4Q For today I think the only way to deal with LDV benchmarks is to count only paths leading to violation of the specified property, not the other ones.

zvonimir commented 7 years ago

I don't understand your comment. These benchmarks do contain paths that lead to __VERIFIER_error. Here is a simple example illustrating this problem:

int main(void) {
  int *x = malloc(sizeof(int));
  int *y;
  *x = 10;
  *y = 20;
  if (*x != 10) __VERIFIER_error();
}

Does this benchmark contain a path leading to __VERIFIER_error? In my opinion it does since y can alias x. Majority of the issues we encounter in LDV benchmarks follow this same pattern.

mutilin commented 7 years ago

In the first comment you reported that benchmarks contain access unallocated memory, but this does not imply that the benchmarks contain path to __VERIFIER_error. So for starting investigation we need the paths leading to ERRORs.
The simple example do contain error path.
The benchmarks are the same as last year and you said that

SMACK has never leveraged the simple memory model, and yet at this point it really does not report almost any false bugs

why do you report these issues so late?

zvonimir commented 7 years ago

Let me answer your questions one by one. (1) So you would like that we submit witnesses here as well? We can for sure do that. (2) Ok, so we agree on that, great. (3) We keep improving SMACK. Last year SMACK was I believe timing out on most of these benchmarks, and hence it did not report these bugs. Also, this is not a lot of false bugs in my opinion given that there is thousands of benchmarks, and we only have to fix/remove around 20 of them.

Finally, it is not true that we reported them late. See our initial attempt here: https://github.com/sosy-lab/sv-benchmarks/issues/191 Now there is many more since we managed to improve SMACK in the past several days, and hence it times out less. Furthermore, I reported some of these issues even last year! I had long discussions with Dirk and others about some of these issues then, and my complaints and bug reports were dismissed given that this used to be under Simple memory model.

mutilin commented 7 years ago

(1) Yes, please submit witnesses confirmed by one of the existing validators (3) Ok, I understand, but I feel that we should find a more efficient way to cope with benchmark problems. Last minute fixes are problematic and more important that they do not solve the problem with LDV benchmarks in general

zvonimir commented 7 years ago

(1) We probably cannot do this since our witness generator is not working properly, especially for large benchmarks. Also, if tools doing witness validation cannot find these bugs themselves, they why would I trust that they are doing witness validation properly? Anyhow, what do we do now?

(3) As you probably noticed, we submitted (and also resolved) numerous issues with SV-COMP benchmarks in the past several weeks. It is not like we had been doing nothing until now, and then intentionally submitted this last minute. There were numerous other (simpler) issues to report and resolve, and we just did not get to this earlier. We reported 3 such issues as I pointed out, and even that still has not been resolved. These are by far the most complex problems.

shaobo-he commented 7 years ago

@mutilin May I ask what do you want confirmed witnesses for? Does it help the debugging process other than giving confidence that the error path may be feasible?

mchalupa commented 7 years ago

@mutilin May I ask what do you want confirmed witnesses for? Does it help the debugging process other than giving confidence that the error path may be feasible?

Well, if you claim that there's an error, you should have some way how to convince others that there really is the error. How do you convince people that it is a bug in the program, not in the verifier? When a program is hard to check manually, then a witness can increase the confidence that there really is the bug. Or if you know about any other way how to support the assertion that the error is reachable, feel free to do it. For the other part of the question: yes, it can (and should) help with debugging.

shaobo-he commented 7 years ago

Well, if you claim that there's an error, you should have some way how to convince others that there really is the error. How do you convince people that it is a bug in the program, not in the verifier? When a program is hard to check manually, then a witness can increase the confidence that there really is the bug. Or if you know about any other way how to support the assertion that the error is reachable, feel free to do it. For the other part of the question: yes, it can (and should) help with debugging.

I don't get why people keep saying "if you know other ways to do it, then do it". You do not understand my question correctly. I was simply asking if a confirmed witness can speed up the debugging process. If you just ignore unconfirmed witnesses, that would certainly speed up debugging since you do not have to do anything. Let's say you have a confirmed witness for a benchmark labeling as true, then you have to go through the error trace and find out why. During this process, how come it helps you root cause?

Moreover, although I think CPAChecker guys are doing great job improving witness validation, our experience suggests unconfirmed witnesses do not necessarily represent unfeasible paths. We simply want to mention there are benchmarks that may not be labeled correctly and solicit feedback from the community. Again, we are not questioning the importance of witness validation. I personally think it is a very useful extension to verifiers.

zvonimir commented 7 years ago

I will excuse myself from this and related discussions. This would be 3rd year in a row that I bring up these issues with device driver benchmarks. In the past, simple memory model was a problem and it was easy to sweep them under the rug. Now it is something else. For example, for some reason witness checkers are considered to be some kind of a "golden standard", and I have absolutely no reason to believe that they are. Anyhow, I find this to be largely a waste of my time. So, instead of wasting my time on pointless discussions, I will try to submit pull requests to fix these issues one by one (time permitting).

mutilin commented 7 years ago

I think pull requests from you will be the best option to proceed. My colleague proposed several fixes in #191, but they do not work for SMACK and don't know why.
If the witness validator confirms the witness there are many chances to just change the verdict of the benchmark. Otherwise, we should investigate the reason carefully. In this case witness validator may also help to undestand where is the contradiction.
At least you can attach current witnesses of SMACK and we will try investigate the case.

shaobo-he commented 7 years ago

@mutilin I have explained why SMACK reports a bug even with your proposed fixes in #191. For example,

SMACK still reports bugs for these benchmarks. For example, in N1, var_group_4 is calloced, so cil_tmp23 at line 8285 is equal to the return value of calloc + 1600. __cil_tmp24 at line 8286 is the deference of cil_tmp23 which equals to 0 (calloced memory). cil_tmp26 at line 8288 will be 360, which is not initialized. __cil_tmp27 at line 8289 is the deference of cil_tmp26 which can be any value. In line 8295, a function pointer is called which is the deference of __cil_tmp27 . So a function that results in the assertion failure will be called since *__cil_tmp27 could alias with any function.

Could you explain what you mean by "contradiction"?

I have attached one witness in #191. I will add more to this issue.

mutilin commented 7 years ago

Could you explain what you mean by "contradiction"?

The intuition that witness validator could confirm some prefix of the path, but could not confirm the rest. The first transition after confirmed prefix is a "contradiction".

For the witness attached to #191 in debug mode of CPAchecker with timelimit 150s I found that it can reach only

 AutomatonState: WitnessAutomaton: A33 __DISTANCE_TO_VIOLATION=-28
 LocationState: N1339 (before line 7138)
 CallstackState: Function rtl8150_probe called from node N1492, stack depth 2 [625a4ff], stack [main, rtl8150_probe]

It points to line 7138 if (tmp___3 == 0) {

dbeyer commented 7 years ago

Is this issue a continuation of issue #191? Then I would like to close that one.

I merged PR #305 today, which fixed some of the above. Can you please point me to the PRs for the others?

dbeyer commented 7 years ago

@zvonimir Please do not compare or confuse this discussion with the discussion for the last SV-COMP edition.

For the example program that you gave above, and given the memory model Simple, the expected verification result last year was TRUE and that is correct. We had lengthy discussions about this and according to the jury, I correctly interpreted the rules. Your verifier reported an error path, and according to the semantics assumed by memory category Simple, your error paths were incorrect. This is not due to validators not being sufficiently strong to be able to reproduce it, but by definition, the path was wrong because you knowingly assumed the (wrong) memory model Precise. This is your decision and completely fine, as long as you don't complain if you don't get positive scores for it. ;-)

For this year, you proposed (in the community meeting @SV-COMP 2016 and via e-mail) to use memory model Precise for the LDV verification tasks and that was widely accepted and supported. Using memory model Precise, the above program contains a potential error path.

Do you see the difference to last year? We are talking about the same program code, but completely different verification tasks. (Completely different semantics definition.)

This time, SMACK's answer might be correct, because you now assume the correct memory model.

Now, why do we need to change the programs, instead of just relabelling them? Because (according to the creators of the benchmarks) the intention is that the challenge should be to find a proof for these programs. In other words, the crucial part of the verification task is perhaps not the code around the memory allocation, but somewhere in the control flow.

This is why I agree with you that the best line of action is to create PRs for fixing the issues.

dbeyer commented 7 years ago

@Guoanshisb To answer your question:

Does it help the debugging process other than giving confidence that the error path may be feasible?

Definitely. You can click on the "validate" button on the witness page that the results tables link to. In case the validator succeeds reproducing the error path, then you will get test values, graphs of the program, step-by-step walk through the path, path visualization as graph, ... Much more is possible due to the witnesses being stored in an exchangeable format.

This CAV-Paper also answers the question: http://link.springer.com/chapter/10.1007%2F978-3-319-41540-6_28

dbeyer commented 7 years ago

As Zvonimir suggests, let's continue with concrete pull requests. If nobody objects, I will close this issue.

tautschnig commented 7 years ago

It seems nobody objected to closing (and going for actual pull requests instead) within 24 days.

sosy-lab / sv-benchmarks

Mislabeled (buggy) LDV benchmarks #297