better regression testing

vogler commented 3 years ago

Problems with the current solution (update_suite.rb):

setup: dependency on ruby and gem install parallel
another programming language for beginners
the code is hard to understand
argument parsing is very manual, only fixed order works
regtest.sh (which allows passing extra arguments for debugging individual cases) should be combined
it's not very fast
- with dune runtest it could only re-run only things that changed (also it could keep preprocessed input, syntax-highlighted source etc.)
it's not very flexible
- if we do checks in goblint, we could better handle precision changes
we use comments to check results: https://github.com/goblint/analyzer/blob/6ebe8168c19687c6227ac41127f1e7c02a46e4d9/scripts/update_suite.rb#L385-L386

Original comment for this issue was regarding the last point.

@jerhard and I were tossing this back and forth over lunch and something we discussed at some point was using something along the lines of __goblint_check(a == 42, __GOBLINT_UNKNOWN) or __goblint_check(x ==17, __GOBLINT_SUCCESS) in the regression tests instead of having an assert(...) //UNKNOWN.

Then, the expected result is immediately obvious to Goblint itself, and Goblint can either output things for all asserts or only those where something failed without having to somehow do a regex on the comments.

Also, once could use the syntactic search to identify those places where __goblint_check(...) is called that are unreachable and then also warn there that the assert is unreachable. (These are the cases where warnings are missing in ./regtest.sh vs the ruby script).

The question that is still open is how to integrate warnings different from asserts into this setting.

Opinions?

Originally posted by @michael-schwarz in https://github.com/goblint/analyzer/issues/286#issuecomment-879796205

vogler commented 3 years ago

Context: https://github.com/goblint/analyzer/pull/286#issuecomment-879788192

I only realized this when I used regtest (which awfully does similar expected comparisons within Goblint...) and saw different output.

I agree with 'awful' concerning duplication, but IMO it'd be better for goblint to handle these things instead of some ruby script - ideally we'd get rid of it in favor of an OCaml executable or some minimal shell script (parallel) with goblint doing most of the work.

https://github.com/goblint/analyzer/pull/286#issuecomment-879848994

Also, once could use the syntactic search to identify those places where __goblint_check(...) is called that are unreachable and then also warn there that the assert is unreachable. (These are the cases where warnings are missing in ./regtest.sh vs the ruby script).

Good point to keep in mind. This (together with the rest?) could be done as post-processing in finalize of the assert-analysis.

using something along the lines of __goblint_check(a == 42, __GOBLINT_UNKNOWN) or __goblint_check(x ==17, __GOBLINT_SUCCESS) in ehe regression tests instead of having an assert(...) //UNKNOWN.

Pro:

type-safe (constant instead of comment)
available in CIL instead of having to read the comment from source

Con:

can't compile and execute it anymore without including another file?
assert is easily understood and portable (assert.h is available on every system)

https://github.com/goblint/analyzer/pull/286#issuecomment-879851445

Tbh, I don't really like the idea of this regression testing check (dbg.regression for regtest) being inside Goblint itself at all. It's ideologically wrong to have the testing logic right in the middle of Goblint (which would be both the tester and the testee simultaneously) instead of being independent. And having to write __goblint_check(x == 17, __GOBLINT_SUCCESS) is so much more verbose than a standard assert(x == 17).

But that's all beside the point for this issue of update_suite not reporting possible unsoundness cases. If we want to discuss completely redoing the regression testing architecture, it's best to have another issue for that. And that won't solve the fact that we still have ambiguous uses of UNKNOWN.

vogler commented 3 years ago

Tbh, I don't really like the idea of this regression testing check (dbg.regression for regtest) being inside Goblint itself at all. It's ideologically wrong to have the testing logic right in the middle of Goblint (which would be both the tester and the testee simultaneously) instead of being independent.

In principle yes, but the case here (at least at the moment) is just comparing strings, so it shouldn't be a problem. I think it's nice to be able to run goblint on a file and let it check the assertions. If you want that feature, the rest is not much more on top.

And having to write __goblint_check(x == 17, __GOBLINT_SUCCESS) is so much more verbose than a standard assert(x == 17).

Agree.

vogler commented 3 years ago

dune runtest/dune promote is nice, but it's also nice to have each test contain the expected result. https://dune.readthedocs.io/en/stable/tests.html#diffing-the-result

Maybe goblint (or some script) could annotate a file with comments for each assert/warning which could then be used for diff/promote.

sim642 commented 3 years ago

That's an interesting idea because it could be used to test for arbitrary warnings (or the lack thereof, which currently must be manually annotated by NOWARN). And it would be testing the actual Goblint output for assertions, just like an user would see on actual programs.

Although it might be too complicated to implement because we'd need to copy the input file and then modify it to add the comments purely based on CIL locations. And it takes away the possibility for TODO to allow precision improvement without making it an error.

vogler commented 3 years ago

Although it might be too complicated to implement because we'd need to copy the input file and then modify it to add the comments purely based on CIL locations.

One needs to come up with a stable solution, but I don't think it'd be too hard. Could just use //@ to indicate these annotations, strip //@.* from the input, then append goblint's WARN X as //@ WARN X.

And it takes away the possibility for TODO to allow precision improvement without making it an error.

Could just leave TODO in place, then have a sep. diff to detect those improvements (just as output, not as an error).

goblint / analyzer

better regression testing #288