Comparison between results of different solver/adapter versions

shkodm commented 5 years ago

During fixing test cases for Ubuntu 18.04 and bringing it to travis as well, I have noticed that some of the results differ. (See #17 for documentation ). Which lead me to think about more general question:

How do we want to compare results for a different versions of solvers, adapters, and operating system? Right now, very simple comparison between files is done, so if some small numerical disturbances occur, systemtest will fail and produce almost completely useless output message. For example for openFOAM <-> openFOAM, when changing openFOAM from version 4 to version 5, test case fails and produce 93 files, that one would need to compare manually with the reference to find out what is the reason of failure...With SU2-Calculix number of files is larger than a 1000.

In the end if one would go through the files, he would find that difference between files is just the header of the output file (where openFOAM version is specified) and non-significant fluctuations in the floating point numbers. So probably this test should not have failed in the first place.

My idea ( already discussed with @MakisH and @BenjaminRueth ), is to have some kind of "maximum numerical difference" and instead of dumb comparison between folders with reference and actual results, compare only relevant numerical values between folders and make the test fail only if the difference is large enough. I already played around with it and have the script the does the generic parsing and computing "average relative numerical difference" between folders here. What else can we do? Of course, we can also just upload new results for each solver and adopter version, but then the size of the repository will be growing quite large. Any ideas or suggestions? @floli @MakisH @BenjaminRueth

MakisH commented 5 years ago

If I understand correctly, the reference results in this case were produced with a different OpenFOAM version, compiled and ran on a different machine. Therefore, I expect them to be slightly different and, after we evaluate the importance of the difference, update the reference results we will use from now on.

Overall, we should always check for the same version of everything (the oldest one we support, I would say).

The "non-significant fluctuations in the floating point numbers" is a bit tricky. If we repeat the same simulation in the same environment, we should receive identical results in our cases. A contributor should know that her changes affected the behavior (even by one digit) and she should evaluate the significance and investigate the reasons.

Apart from this, I like the idea of having a measure of the maximum difference we introduce (could be an error, could be an improvement).

Could you maybe point to some example data for your script?

shkodm commented 5 years ago

Yea, you can test it with for instance:

git clone https://github.com/shkodm/systemtests 
git clone https://github.com/shkodm/precice_st_output

And then to output the relative differences above 1 percent:

sh systemtests/compare_results.sh precice_st_output/Ubuntu1804/Output_of-of/ systemtests/Test_of-of.Ubuntu1604/referenceOutput/ 1

MakisH commented 5 years ago

We tried it together today, here are some comments:

[x] For me it didn't work with sh, complaining about this line:
```
mapfile -t array_files < <( echo "$diff_files"  | sed 's/Files\|and\|differ//g' )
```
with this error:
```
systemtests/compare_results.sh: 22: systemtests/compare_results.sh: Syntax error: redirection unexpected
```
Executing it with bash instead of sh worked (what's the difference?). The reason is that, apparently, bash is not the default on Ubuntu.
[x] It would be better style that the threshold is in absolute value instead of percentage (e.g. 0.01 instead of 1)
[x] Having an average is not really a useful metric. It should better be a maximum, or both.

MakisH commented 5 years ago

When opening a PR for this, please also use a low threshold (e.g. 1e-6).

precice / systemtests

Comparison between results of different solver/adapter versions #19