Closed shkodm closed 5 years ago
If I understand correctly, the reference results in this case were produced with a different OpenFOAM version, compiled and ran on a different machine. Therefore, I expect them to be slightly different and, after we evaluate the importance of the difference, update the reference results we will use from now on.
Overall, we should always check for the same version of everything (the oldest one we support, I would say).
The "non-significant fluctuations in the floating point numbers" is a bit tricky. If we repeat the same simulation in the same environment, we should receive identical results in our cases. A contributor should know that her changes affected the behavior (even by one digit) and she should evaluate the significance and investigate the reasons.
Apart from this, I like the idea of having a measure of the maximum difference we introduce (could be an error, could be an improvement).
Could you maybe point to some example data for your script?
Yea, you can test it with for instance:
git clone https://github.com/shkodm/systemtests
git clone https://github.com/shkodm/precice_st_output
And then to output the relative differences above 1 percent:
sh systemtests/compare_results.sh precice_st_output/Ubuntu1804/Output_of-of/ systemtests/Test_of-of.Ubuntu1604/referenceOutput/ 1
We tried it together today, here are some comments:
[x] For me it didn't work with sh
, complaining about this line:
mapfile -t array_files < <( echo "$diff_files" | sed 's/Files\|and\|differ//g' )
with this error:
systemtests/compare_results.sh: 22: systemtests/compare_results.sh: Syntax error: redirection unexpected
Executing it with bash
instead of sh
worked (what's the difference?). The reason is that, apparently, bash is not the default on Ubuntu.
[x] It would be better style that the threshold is in absolute value instead of percentage (e.g. 0.01
instead of 1
)
[x] Having an average is not really a useful metric. It should better be a maximum, or both.
When opening a PR for this, please also use a low threshold (e.g. 1e-6).
During fixing test cases for Ubuntu 18.04 and bringing it to travis as well, I have noticed that some of the results differ. (See #17 for documentation ). Which lead me to think about more general question:
How do we want to compare results for a different versions of solvers, adapters, and operating system? Right now, very simple comparison between files is done, so if some small numerical disturbances occur, systemtest will fail and produce almost completely useless output message. For example for
openFOAM <-> openFOAM
, when changing openFOAM from version 4 to version 5, test case fails and produce 93 files, that one would need to compare manually with the reference to find out what is the reason of failure...With SU2-Calculix number of files is larger than a 1000.In the end if one would go through the files, he would find that difference between files is just the header of the output file (where openFOAM version is specified) and non-significant fluctuations in the floating point numbers. So probably this test should not have failed in the first place.
My idea ( already discussed with @MakisH and @BenjaminRueth ), is to have some kind of "maximum numerical difference" and instead of dumb comparison between folders with reference and actual results, compare only relevant numerical values between folders and make the test fail only if the difference is large enough. I already played around with it and have the script the does the generic parsing and computing "average relative numerical difference" between folders here. What else can we do? Of course, we can also just upload new results for each solver and adopter version, but then the size of the repository will be growing quite large. Any ideas or suggestions? @floli @MakisH @BenjaminRueth