counterexamples out of range in nn4sys2023

huanzhang12 commented 1 year ago

In the VNN-COMP 2023 nn4sys benchmark, many tools got “incorrect” results. Since the majority part of this benchmark was the same as 2021 and 2022, these “incorrect” results look quite suspicious, so @KaidiXu and I looked into why they happened. We found the issue was that the counterexamples provided by one tool violate the constraints on x, but it is not detected by the evaluation script due to an absolute error threshold being used. Other tools verified these examples to be safe and were mistakenly counted as “incorrect”.

Below are the two instances where our tool (alpha-beta-CROWN) was counted as “incorrect”:

In lindex_deep_lindex_1000.counterexample, the counter example has x = 0.008319346369826038, however the actual range of x given by the vnnlib is 0.0082591306418180 <= x <= 0.0082597080618143 (see line 200 after unzip in the vnnlib file)

In lindex_lindex_1000.counterexample, the counter example has x = 0.008319346369826038, however the actual range of x defined by the vnnlib is 0.0083549842238426 <= x <= 0.0083560608327389 (see line 358 after unzip in the vnnlib file)

In these two counterexamples, the input x in the counterexample is clearly outside of the valid ranges. However, the evaluation script uses an absolute threshold 1e-4 to check the boundary of x, which is insufficient in this case because x is very close to 0, so large errors on x were not detected. I think the counterexample checking script should use relative error to determine whether x is out of range, and it matters when x is very close to 0 in this case. We hope in the final report that the competition organizers can rerun the evaluation script for this benchmark and double-check these counterexamples with corrected tolerance.

stanleybak commented 1 year ago

Thanks @huanzhang12 and @KaidiXu this helps. Based on this, I guess the situation looks like this:

In this case, NNV reported a counter-example that was within a small tolerance of the possible inputs, although other tools said it was unsat. What is the suggestion of what to do in this case? It seems like both SAT and UNSAT are possible answers.

Something else that could happen with tolerances would be if the outputs looked like this:

The same thing, both answers seem to be possible.

Also: Any idea if something similar happened with the traffic signs recognition 2023 benchmark? The input/output errors there look much larger than the tolerance.

The rules were a bit ambiguous on this point... I had to modify things to state when a relative and when an absolute tolerance was used and updated the tolerances... trying to make it so that no tool was penalized for the outputs not matching the inputs when executing using onnxruntime, but I didn't anticipate it could affect the sat/unsat verification result.

The latest wording is "However, there will be no penalty as long as the execution of the network using onnxruntime matches the witness up to a small numerical precision (mismatch is less than 1e-3 relative error), and the constraints are met up to 1e-4 absolute error. "

The original wording from back in April was: "However, there will be no penalty as long as the execution of the network using onnxruntime matches the witness up to a small numerical precision (mismatch is less than 1e-4), and the constraints are met up to this same precision."

KaidiXu commented 1 year ago

Actually, the counterexamples provided by NNV violate the constraints on $x$ rather than $y$, like the figure shown here. The output is indeed SAT by the invalid $x$. The input constraint with 1e-4 absolute error is too large for the nn4sys benchmark since their constraints are relatively small. So I believe the relative error should make more sense. Also, there was a discussion about it over here, but only mentioned using relative error on output constraints.

huanzhang12 commented 1 year ago

@stanleybak Thank you for getting back to me and looking into the issue! I read the latest rules, and it does say 1e-4 absolute error on input constraints, so those adversarial examples are valid. But tools that give "unsat" results should not be considered wrong, because we may say both "unsat" and "sat" are correct due to the large absolute error tolerance. If the tolerance is smaller, the results would be "unsat". I feel relative error makes more sense in this setting (since the X is not normalized), and we can discuss this next year.

For traffic_sign_recognition, there were two separate issues. (1) the ONNX predictions greatly differ from the Y (witness) saved in counter-example files. Due to the sign activation, random numerical error in the middle will be amplified to 0 or 1, causing entirely different results. Although the ONNX prediction may still be SAT, it is not the same as the witness. Under the rule, these adversarial examples will be counted as incorrect.

The second issue was a bug in the scoring script. When multiple tools find counter-examples, as long as one tool finds a valid counter-example, the script will assume counter-examples found by other tools are all correct, even if they are actually "incorrect" due to the numerical issue above. This is indeed a bug that should be fixed, otherwise the penalty will be incorrect (lower) for tools that indeed report wrong counter-examples.

https://github.com/ChristopherBrix/vnncomp2023_results/blob/0b9dbb66e3eb5c6bc79ce6e6a1de84dab57cf295/SCORING/process_results.py#L718-L721

    for ce_valid_res in ce_results.values():
        if ce_valid_res == CounterexampleResult.CORRECT:
            valid_ce = True
            break

Here valid_ce means any of the tools find a valid counterexample. Following the if statements after this, we have two conditions that lead to penalty:

    elif penalize_no_ce and res == "violated" and not ce_results[tool_name]: # in 2022 also had: num_holds > 0 
        # Rule: If a witness is not provided when “sat” is produced, the tool will be assessed a penalty.
        score = Settings.PENALTY_INCORRECT
        # some code here ommitted
    elif res == "violated" and not valid_ce:
        # incorrect witness
        score = Settings.PENALTY_INCORRECT

If I understand the script correctly, in the first elif, the condition ce_results[tool_name] seems just checked whether the witness is provided, and it does not check whether it is correct; in the second elif, it only checks against valid_ce, which is True if any of the tools report a correct counterexample.

The proposed fix is to change the second elif, including a condition to actually check incorrect witness:

elif res == "violated" and (not valid_ce or ce_results[tool_name] != CounterexampleResult.CORRECT):

For this benchmark, the results will change dramatically after fixing this bug.

mldiego commented 11 months ago

Hello @huanzhang12 @stanleybak @KaidiXu,

I looked into the results for the 2023 nn4sys benchmark after the competition, and I found that all our (NNV) counterexamples generated were wrong (only for this benchmark), as there was a parsing error for this type of vnnlib specifications.

I reported our errors to Christopher in case we need to recompute the scores, but I had missed this issue.

stanleybak / vnncomp2023

counterexamples out of range in nn4sys2023 #4