Open casella opened 9 months ago
I agree that such information is useful, and ideally it should only be needed for a fraction of the models.
I believe that what we have used for Dymola is instead of modifying the package that information is available externally - and also to loosen the tolerance so that we don't get false positives; but I will check.
As I can see that has a number of advantages:
I'm not 100% happy with this direction, as it means that the default will be to use the settings from the experiment
-annotation, meaning that we don't automatically detect that decreasing tolerance
by an order of magnitude has significant impact on the result.
I see good reasons to generate reference results according to a procedure where tolerance
is automatically set an order of magnitude lower than that in the experiment
, so that CI correctness tests running with the experiment
settings can detect when the experiment
is inappropriate.
I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this. Similar to what @HansOlsson's says regarding Dymola, we at Wolfram also keep information for overriding the experiment
in files that are separate from the Modelica code, with the ability to both control reference result generation and settings for running tests (used in rare situations as an alternative to using very sloppy comparison tolerances when the experiment
settings are considered really poor).
A related topic is the use of different StopTime
for regression testing and for model demonstration, but this might be better handled with a TestStopTime
inside the normal experiment
.
I honestly would not introduce any new annotations etc to the language. In my opinion Modelica already has all the means one needs to achieve this, e.g., inheritance and modification. What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.
For example, you can have basic user experiments in your library (i.e., examples) and in a separate regression test library refine these experiments with different setting (change tolerances, stop times whatever). In this setup, the regression testing setup is not part of the final library distribution, as it likely should be (because you might add a bunch of scripts and other software for a continuous integration support resulting in huge dependencies to make your automatic tests run). You just don't want to ship all of that. And as @henrikt-ma mentions, good pratice at the tool vendors is already to have separate settings for this.
I think, testing is a matter of project policies and templates. MSL can come up with a good scheme for its use case and improve its infrastructure without any need for language changes.
I respectfully disagree with @henrikt-ma and @christoff-buerger:
Regarding this comment
What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.
it is true that each MA project has its own responsibility, but the whole point of the MA is to provide coordinated standards and libraries. In this case, I clearly see the need for a small language change to support the work by MAP-Lib. BTW, we are talking about introducing one new annotation, which is no big deal. We could actually introduce a vendor annotation __MAP_Lib_verificationExperiment, but this really seems overkill to me, can't we just have it as a standard Modelica annotation?
@GallLeo, @beutlich, @dietmarw, @AHaumer, @hubertus65, I'd like to hear your opinion.
First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.
I had a look at the "heuristics" proposed 10 years ago, when I started creating the first set of reference results: https://raw.githubusercontent.com/wiki/modelica/ModelicaStandardLibrary/MSLRegressionTesting.pdf#page=11 Takeaway: These rules worked quite fine for most test cases (at least much better as if we had used tolerance 1e-4 everywhere). Back then, I was not able to use a tight tolerance everywhere, because some models would only simulate with their default tolerance. Are these models to be considered "bad models"? Or should we be able to test any working model?
After the 4.0.0 release, @beutlich documented his adaption of tolerances in the MSL wiki: https://github.com/modelica/ModelicaStandardLibrary/wiki/Regression-testing-of-Modelica-and-ModelicaTest Takeaway: Finding the "right" strict tolerance for MSL examples still needs manual work.
So, in order to make it transparent, I'm in favor of explicitly specifying verification settings. But, then we have to limit the manual work for setting these up as much as possible: If a example/test case is changed, should the verification setting be updated automatically? Example: If you change stop time of an example from 1000 s to 2000 s, interesting dynamics could be missed, if you keep your verification stop time ate 1000 s.
Where to specifiy the explicit verfication settings?
@christoff-buerger proposed using extends
.
We could do that in ModelicaTest, without manually writing a new library:
ModelicaTest.Modelica...MyExample extends Modelica...MyExample annotation(experiment(...));
Benefit: All MSL-test cases would reside in ModelicaTest (no need to run two libraries)
Drawback: CI or library developer has to generate the extends-test-case in ModelicaTest.
If the test case for a new example is not generated, it will be missed in regression runs.
If the experiment setup of an example is changed, it will not be automatically updated in the test case.
@HansOlsson proposed storing the verifications setup externally.
This might work, as long as you test in one Modelica tool and this Modelica tool cares about the verification setup.
I'm unsure, how to handle it in multi-tool environments.
We already have the burden of updating Resources/Reference/Modelica/.../MyExample/comparisonSignals.txt
Seperate files are very hard to keep up to date, therefore storing verification settings in the example/test case seems the be right place to me. Especially, if we think about using signals from figures
annotation as important comparison signals.
I would add arguments to experiment
instead of new verificationExperiment
.
Reasons:
I am pretty much aligned with @GallLeo here.
Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse. In strict consequence there should be no directory in Modelica/Resources/Reference/ at all.
Reference models should be taken from ModelicaTest. If example models from Modelica shall be also considered for regression testing, it might be an option to extend from it and adopt the experiment settings. This would also simplify the job of the regression tester since only one library needs to be taken into account.
I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.
I am in favour to keep reference models outside Modelica and only use ModelicaTest for it. I am in favour to not have reference signals and experiment settings distributed in various locations, i.e., option 2 or 3 are my preferences. I even see more advantage in option 3. In that case it is left to specify (if at all)
I still need to be convinced that we need to have this in the specification or if it is simply up to the library developers of ModelicaTest. (Remember, it is now all in ModelicaTest and not any more in the MSL.)
I would also be in favour of having the verification settings separate from the simulation model but referenced from within the simulation model. So basically @beutlich's option 3.
First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.
I agree that many of those rules will not be adopted by all libraries.
Additionally to me use-cases such as this is one of the reasons we want the possibility to have the information completely outside of the library; so that the user of the library can add the things needed for testing - without changing the up-stream-library.
IMHO @casella is right with his analysis and his idea. I'm not fond of storing additional to the model the comparisonSignals.txt and separate settings for comparison / regression tests. It is much better to store this information (stop time, interval length, tolerance, what else?) within the model, either a second annotation or as part of the experiment annotation. If not present, the setting of the experiment annotation are valid (in most cases). Additionally we could specify the tolerance for comparison. The model developer decides whether changes for comparison / regression tests are necessary or not, maybe after receiving a message that comparison / regression tests are problematic (could be from tool vendors). As this annotation would be supported by most tools, this could be adopted by all third-party libraries. Regarding test examples: I think it's a good idea to use all "normal" examples that are intended to demonstrate "normal" usage, and additionally "weird" examples from Modelica.Test.
One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.
I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.
Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.
I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.
I also want to make people aware of the TestCase
annotation we already have, which is a candidate for a place to add test related things.
One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.
I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.
I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:
Note that there can be other reasons than chaos for separating testing from running, e.g.:
Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:
Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.
(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.) I don't know why the StopTime is 1.01 s instead of 1 s.
I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:
* Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash * Comparing it with the reference (will be outside the bounds if using original end-point)
Thank you for clarifying this. I agree the test (in a perfect world) in this case should include both these steps.
Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:
Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.
(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.)
As a user of the model, I would not want to see the weird under-sampling of the engineTorque
either. How am I supposed to know if it is the actual behavior?
The file size issue for testing can be dealt with by the tool only storing the variables you need for the comparison. For the user, if we introduce figures in the model, tools could be smarter there as well and only store (by default) what you need for plotting (and animation).
I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.
At Wolfram, where we have somewhat extensive use of figures in the models, we have the default policy to simply use all variables appearing in figures as comparison signals (@GallLeo already briefly mentioned this idea above). It has several advantages:
Figure.group
(Regression Testing, for instance).One small limitation of the approach is that there is no way to say that a variable in a figure should be excluded from correctness testing. I'm not aware of any model where we have needed this, but still… As @maltelenz suggested, the TestCase
-annotation could be the perfect place for this sort of information. For example:
annotation(TestCase(excludeVariables = {my.first.var, my.arr[2].y}));
I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.
I think it is well understood that there isn't one Tolerance
that will work for all models. What I'm afraid is less well understood is that it is very hard to detect this when reference results are generated with the same tolerance which is used when running a test. To avoid this, I'd like something like the following:
annotation(TestCase(experiment(toleranceFactor = 0.01)));
The default for toleranceFactor
should be something like 0.1, and I would expect that there would rarely be a reason to override the default; when there is a correctness error due to tolerance settings, one would fix this by adjusting the Tolerance
, not by using a toleranceFactor
closer to 1.
I am also in favor of adding a new annotation, or an entry to the experiment
annotation, that allows specifying data needed to produce result files for cross comparison. Our current setup is to simply increase the tolerance by a factor of 10, but such a global policy turns out to be unsatisfactory. In my opinion, it is the modelers responsibility to add this information. Most of our contributors are not "professional" developers but mainly researchers or student who use the library and occasionally develop contributions so asking them to put all information inside the .mo
file would be preferred.
Therefore, having model-specific specifications for the CI tests would be valuable. These specification should include
StopTime
for the results tool cross-checking if a model is chaoticCurrently we have (some of) these information spread across different files for about 1700 test models. This is due to historical reasons. It would be good to refactor this as the current situation makes it hard for new developers to figure out what is to be specified where, and it makes it harder to explain how to port changes across maintenance branches. Having these inside the .mo
annotation is my preferred way.
This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.
This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.
Please just make sure to keep the language group in the loop. Is it the MAB-Lib monthly meeting one should participate in to engage in the work, or some other forum?
One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.
Conceptually speaking, it's both things, as also mentioned by @mwetter in his comment below. Whether or not we want to put the information about how to create the reference data in the same annotation or in a different annotation it's just a matter of taste but does not change the fundamental requirement (for me), i.e., that all the relevant information for testing a model should be embedded in the model through annotations, not spread over other files. We already have this concept for documentation, I don't really see why testing should be different.
Regarding storing the names of the variables to be used for automated comparisons (possibly using regexp to keep them compact) I also think they should have a place in an annotation, rather than in a separate file, for the same reason the documentation is embedded in the model and not stored elsewhere. I agree with @henrikt-ma that we should make good use of the information provided by the figures
annotation by default, because in many cases variables that you want the user to see are also good to check if the simulation results are correct. But these two requirements are not necessarily identical (see my next comment), so there should be means to declare that some plotted variables should not be used for testing, or that some more variables should be tested that are not plotted.
I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this
@henrikt-ma with cross-tool checking of the Buildings library, this happens in a significant number of cases, and it's really a nuisance if you want to make the library truly tool-independent with a high-degree of dependability. @mwetter can confirm. The solution, so far, was to tighten the tolerance in the experiment annotation, e.g. to 1e-7 for all those cases. This is really not a clean solution, because those simulations become unnecessarily slower and, most importantly, the difference in the obtained results is small, much smaller than the modelling errors inherent in the simulation model, which of course has lots of modelling approximations. As I'll argue once more in the next comment, the requirements for human inspection are completely different from the requirments for automatic cross-tool checking.
I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.
@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.
I re-read all the comments, and I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves. Let me add some extra considerations about that, which I hope can further clarify the issue.
Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.
I agree with that, with a few important remarks
I also want to make people aware of the
TestCase
annotation we already have, which is a candidate for a place to add test related things.
Good point.
I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves.
Just a quick remark: this division is not terribly surprising to me, because it can also be observed in general in programming languages' testing tools -- in my personal experience there seem to be (at least) two major "schools":
I'll refrain from discussing respective pros/cons, since much of this comes down to personal preference, and to avoid bloating the discussion here. I just want to point out that the two approaches have widespread use/merit, and which one is "better" depends on a load of factors, probably.
As far as my experience is concerned, what sets Modelica somewhat apart from the ones above is the noticeable dominance of "reference testing"/golden master/regression testing. The approach of asserting smaller facts about models' behaviour, e.g. like XogenyTest proposed, seems to have very little to no use from my vantage point. I guess much of this is owed to the specific problem domain (in essence, time-dependent ODE simulation), but I'm wondering if an assert-oriented testing approach (i.e. testing specific facts about a component/model, not the whole set of time traces) could alleviate some of the problems encountered?
I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.
@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.
I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements. The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment
, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.
If we instead generate the reference result with a sloppy tolerance from the experiment
, it could already have results that are on the "edge" of the "band" around the perfect solution that we consider acceptable. If a different tool then gets results that are the same distance away from the perfect solution, but on the other edge of the band, it will fail the test.
As a last remark (I apologize for flooding this ticket), it is my goal as the leader of MAP-Lib to eventually provide an MA-sponsored infrastructure that can test open-source libraries (starting from the MSL and ModelicaTest, of course) by running all models that can be simulated with all Modelica tools, automatically comparing their results and visualizing the outcome of the comparisons in a web interface. This would allow the library developer(s) to easily inspect the results, and eventually pick the results of any one tool as reference results, based on his/her expert judgement, so that he/she is not limited in the development to the availability of tool(s) for which he or she has a license. It will also allow the library users to pick the best tool(s) to run a certain library. I believe this is the only way we can truly foster the concept of tool-independent language and libraries, moving it from theory to practice.
In order for this thing to work, we have to lower the entry barrier as much as we can, so that we get as many libraries as possible in. As I envision it, one could start by asking for his/her library to be tested. He/she will then be able to compare the results with different tools, which may also indirectly point out models that are numerically fragile, and eventually select reference trajectories among the ones that were computed, not necessarily with the tool(s) that he or she has installed on his computer. In most cases, assuming the tools are not buggy and that the model is kosher with respect to the language spec (a mandatory requirement IMHO), selecting any specific tool result obtained with the default experiment annotation as a reference will cause all other tool results to fall within the CSV-compare tubes, so everybody's happy.
For a few corner cases (2-5%?) it will be clear that the model is numerically trickier, so the library developers may introduce a verificationExperiment annotation to determine tight enough conditions that allow to reduce the numerical errors below the CSV-compare tool tolerance, so that all tool results are close enough. In some cases, it will be clear that the results of some tools are plain wrong, and this will provide useful information to tool developers for bug fixing. Some other times, every tool could give a different result, which may be a strong indication of a numerically fragile model, which is useful information for the library developer. This is also an essential feature for MSL development: developing the "Standard" library with a single tool is not really a viable proposition, the developer needs as much feedback as possible from all MLS compliant tools to develop something truly standard.
Once this process has converged, the library will be usable with all tools and will have all the information embedded within it to allow continuous testing of this property, which will be publicly demonstrated on the MA servers. I believe this would be a very strong demonstration of the power of the Modelica ecosystem.
Now, the crucial point to make this dream come true is that the additional effort for the library developer to enter this game will need to be as low as possible, otherwise this is simply not going to happen.
With my proposal, and with some reasonable heuristics to pick meaningful comparison variables absent their indication (e.g. only the state variables for dynamic models) one can run cross-tool comparisons of an open-source library with zero additional effort by the library developer. The idea is that the majority of tested models will be fine, and it will only be necessary to provide a few custom annotations for the critical models, e.g. selecting specific variables for testing or changing the verificationExperiment annotation.
BTW, this task is conceptually under the responsibility of the open-source library developer(s), which however may have not enough time or motivation to take care of it. What is nice it that other parties (i.e., the Modelica community) could help in this process by means of pull requests to the library code base that introduce such annotations where needed. These pull requests with a few added annotations can preliminary be fed to the MA testing infrastructure, so that the library developer can see the result and accept the PRs with just one click of the mouse, if the results are fine. Which means, he or she can easily put in the judgement, without much effort. This is how we'll get things done. We could get student involved in this work, which would also promote Modelica with the younger generations.
IMHO this kind of process has a much higher chance of practical success than a process where we expect that Library Officers or open-source Library Developers (which are usually volunteers doing this in their spare time) have unlimited time and resources to set up elaborate testing libraries with ad-hoc developed test models, elaborate rules, multiple files, scripts and whatnot. Ideally, this may be the best option, but in practice, this is never going to happen. We need a viable path to achieve the dream I envisioned, and I believe that the verificationExperiment annotation, alongside with annotations for selecting the reference variables, is a key feature for that.
My 2 cts as MAP-Lib leader. 😃
I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements.
I agree, except for the case of chaotic systems, for which there are theoretical reasons why even a very, very tight tolerance doesn't work in the long term, due to exponential divergence of very close trajectories.
The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing
experiment
, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.
"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?
Anyway, the problem here is that a solution with a 2% error, obtained from a model that has 10% uncertainty on key parameters (which is pretty normal for thermal systems) may be perfectly fine for all practical purposes, except automated verification, for which we have (rightfully) set a 0.2% relative tolerance. One thing is to get a result which is good enough for some application, another thing is to verify that two different tools give the same result if the numerical approximations are good enough.
Why should we unnecessarily use a tigher tolerance in the experiment annotation, hence longer simulation times, to stay within the CSV bounds, which have no consideration for the model uncertainty?
The fundamental issue addressed by this ticket is that the requirements for simulations are completely different whether you want to use the results to make decisions about a system being modelled, or you want to use the simulation result for cross-tool and regression checking. Different requirements lead to different simulation setups. Hence, different annotations to specify them.
I would partially agree with @beutlich comment.
To me ModelicaTest is not only models testing each component once, but intended as unit-tests of components; so if there's a weird flag there should be a test-model for that. As always coverage and testing goes hand in hand.
It might be good if we also added some "integration tests" (in the testing meaning) - and to me that would fit naturally in ModelicaTest, but perhaps separated in some way. However, I understand that we might not have resources for that.
In contrast, the Examples models in MSL are primarily constructed as Examples demonstrating how to use (and sometimes not use) the library. Thus in one sense using them for (cross-tool-)testing is misusing them, but on the other hand we have them and they should work reliably (at least in general: see #2340 for an exception, and as discussed we also have chaotic systems as another exception) - so why not test them as well?
However, we should still keep the mind-set that Example-models are examples - not testing-models.
After thinking through this I think that one conclusion is that since the Examples aren't primarily for testing and they shouldn't be the only tests for the library, it should be ok to reduce the simulation time considerably for specific Example-models (due to chaotic behavior or too large results).
Whether such information is stored in the library or separately is a somewhat different issue. Both should work in principle, but I could see problems with getting agreement on the exact levels for different tools - and thus the need to add something outside of the library for specific cases and tools - which means that if we start inside the library we would have both.
Just wondering: Wouldn't it be more sensical to compare the distributions generated by chaotic systems instead of comparing data points sequentially within a short interval?
Just wondering: Wouldn't it be more sensical to compare the distributions generated by chaotic systems instead of comparing data points sequentially within a short interval?
Possibly, but in order to get a good reference for the distribution one would need to simulate a lot more. Clearly there are also entirely different ways of getting the distribution - but I think that is beyond this issue
I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements.
I agree, except for the case of chaotic systems, for which there are theoretical reasons why even a very, very tight tolerance doesn't work in the long term, due to exponential divergence of very close trajectories.
Great, I fully agree with you here!
The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing
experiment
, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable."Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?
Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).
The option to run tests both with the user-facing experiment
settings, as well as with the same tolerance as the reference generation, would fulfill some of both our wishes. But then, how do you determine if the run with experiment
settings was "successful"? You end up with the same question we started with.
Anyway, the problem here is that a solution with a 2% error, obtained from a model that has 10% uncertainty on key parameters (which is pretty normal for thermal systems) may be perfectly fine for all practical purposes, except automated verification, for which we have (rightfully) set a 0.2% relative tolerance.
If the expected behavior of the model with its experiment
settings is to vary by 2% (or 10%), does it make more sense to set the relative tolerance when checking the result somewhere in that region then? After all, you are saying that the uncertainty is that large, why are we not trying to match using a way (way way) sloppier check, instead of messing about testing something nobody will experience (special simulation settings for testing)?
The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing
experiment
, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable."Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?
Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).
I can see this point, and in that case the solution seems to be to not tighten the solver tolerance, but to reduce the stop-time or at least the time we compare up to and/or loosen the tolerance when comparing. With the current setup we can get the former by only running the reference up to an earlier point.
For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.
Since the Examples aren't the only tests (or even primarily tests) I don't see that as a problem. ..
However, I can also see that it might make sense to run the tests with a tighter tolerance. The idea is then that we consider three simulation results:
I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".
If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).
However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.
The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing
experiment
, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable."Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?
Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).
I can see this point, and in that case the solution seems to be to not tighten the solver tolerance, but to reduce the stop-time or at least the time we compare up to and/or loosen the tolerance when comparing. With the current setup we can get the former by only running the reference up to an earlier point.
For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.
For chaotic systems, I agree something drastic like this is needed, and "test exactly what the user sees" is just not reasonable.
However, I can also see that it might make sense to run the tests with a tighter tolerance. The idea is then that we consider three simulation results:
* Reference * Tool result 1 (strict tolerance) * Tool result 2 (default, less strict, tolerance)
I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".
If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).
However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.
I could get behind something like this. I assume "Reference" would also be produced with "strict tolerance" in this case.
Again a short call for clarification: What is considered to be the “reference data generating process” in the absence of an exact mathematical function (i.e., solution)?
EDIT: Is it a “perfect” (i.e., reference) implementation which can produce arbitrarily precise (exact) data across all operating systems/implementation instances out there? (I had thought this to be an “expectation” value across a “bag of tools” …)
I fully agree with @casella 's opinion, and I fully support his ideas about the testing environment. Just some small remarks: I seems clear to me that a small portion of the models need specific tighter settings for verification. Maybe it is possible to start these models twice:
For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.
In these cases, instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.
I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this
@henrikt-ma with cross-tool checking of the Buildings library, this happens in a significant number of cases, and it's really a nuisance if you want to make the library truly tool-independent with a high-degree of dependability. @mwetter can confirm. The solution, so far, was to tighten the tolerance in the experiment annotation, e.g. to 1e-7 for all those cases. This is really not a clean solution, because those simulations become unnecessarily slower and, most importantly, the difference in the obtained results is small, much smaller than the modelling errors inherent in the simulation model, which of course has lots of modelling approximations. As I'll argue once more in the next comment, the requirements for human inspection are completely different from the requirments for automatic cross-tool checking.
Are we even talking about the same automatic settings? What I have in mind was described in https://github.com/modelica/ModelicaSpecification/issues/3473#issuecomment-1933650490. That is, reference results are by default generated with a tolerance 10 times smaller than experiment.Tolerance
. This means that before you even get to cross-tool checking, the library developer has to ensure that experiment.Tolerance
is set well enough to at least pass tests running with the tool at hand (probably coinciding with the tool used to generate the reference). Once the library and its reference results fulfills this basic requirement on the experiment.Tolerance
setting, it becomes much more interesting to see how well it works in cross-tool checking.
If this procedure is thought to cause unnecessarily slow simulations, I would blame that on comparison tolerances being too tight in relation to the purpose of the model. Want more speed? Lower your expectations on correctness, and it should be possible to increase the Tolerance
while still being able to match a reference generated with 10 times stricter tolerance. If comparison tolerances are extremely generous so that very big Tolerance
could be used, a toleranceFactor
of 0.1 could be too big, and this would be one of those rare situations I have yet to see.
Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied
I see your point. If you want to see that, I guess we should use a much looser tolerance for the CSV-compare tool. Not 0.2%, but probably 5% or maybe more.
The problem is, this may lead to accepting results from tools that are actually doing it wrong, just because they don't do it too much and their error is lost among the tolerance for not-too-small numerical errors.
(although this can sometimes give very helpful information when debugging models that don't behave as expected).
My proposal was implicitly only aimed at this goal. Now that is clearer, thanks for pointing this out.
The idea is then that we consider three simulation results:
* Reference * Tool result 1 (strict tolerance) * Tool result 2 (default, less strict, tolerance)
I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".
If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).
However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.
After considering @maltelenz's arguments, I agree with this proposal by @HansOlsson.
In fact, experience shows that in most cases the results obtained with the experiment annotation Tolerance and StopTime will fall within the 0.2% default tolerance of the CSV-compare tool, so there won't be any need of extra annotations, nor to perform two separate verification simulations, we can go on as we're doing now. However, there will be a small (but non-negligible) number of cases where this doesn't happen.
In these cases we should add an extra annotation with tighter tolerance and possibly shorter StopTime, to handle chaotic systems. We would then run two verification simulations. The first with the default tolerance (but the possibly shorter StopTime) and a more lenient CSV tolerance, e.g. 5%; this ensures that the user experience is not too bad, which is @maltelenz's main concern. The second uses the tigher tolerance and the possibly shorter StopTime, and uses the original 0.2% CSV-compare tolerance; this ensures that once the numerical errors are drastically reduced, the result is very close to the reference and the tool is not acting weird. Both verification tests should pass.
In the first case (only old-fashioned experiment annotation is set) the reference trajectory will be generated with a default tolerance that is 10 times tighter than the one specified by the experiment annotation. In the second case, it will be generated with the tigher tolerance specified by the new annotation.
At this point I guess having a separate verificationExperiment
annotation is not really a good idea. Maybe we could just add three other fields to the existing experiment annotation, with appropriate defaults:
StopTimeVerification = StopTime
to be used both for reference generation and verification simulationsToleranceReference = 0.1*Tolerance
to be used for reference generationToleranceVerification = Tolerance
to be used for verification simulationsIn the 95% non-problematic cases, we can keep the old-fashioned experiment annotations and rely on the default values both to generate the reference results and to run verification simulations. In those few cases where this causes problems, the library officer/developer can add those three values to the experiment annotation, to specify how long the verification simulation should be, what tolerance to use to generate the reference results, and what tolerance to use when generating verification results for the 0.2% CSV-compare test. The 5% CSV-compare tests will be run with the regulare experiment annotation Tolerance, but only until StopTimeVerification, because chaotic trajectories will eventually deviate as much as 100% or more, given enough time.
Is there some consensus on this approach? If so, I can open a new ticket where we can refine the details.
Again a short call for clarification: What is considered to be the “reference data generating process” in the absence of an exact mathematical function (i.e., solution)?
The reference data are simulation results that the modeller judges to be good enough to be used as a reference, based on all information and knowledge available to him/her. Of course this requires some kind of trust that the used simulation tool is actually doing the right things. From this point of view, it would be nice for the modeller to be able to see what kind of reference results will be generated by different tools and compare them (See my proposal for an MA CI above). Although mathematical correctness cannot be judged with democratic criteria, if I generate such trajectories with 6 different tools and 5 are very close to each other while one isn't, there some strong clue (though not rigorous proof) that the outlier is due to a tool bug, while the other ones are all reliable. The modeller would then pick one among those five (which one doesn't really matter) as the reference. Of course if those five make sense from a physical point of view, one should never blindly accept the results of a simulation.
EDIT: Is it a “perfect” (i.e., reference) implementation which can produce arbitrarily precise (exact) data across all operating systems/implementation instances out there? (I had thought this to be an “expectation” value across a “bag of tools” …)
I wish I could rely on such an implementation, but I don't think we have one now, nor we'll have one in the foreseeable future, that can fully cover the spectrum of models that can be written in Modelica. So, see above for my pragmatic proposal.
Instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.
This can be done in some cases, but not always; also, it may be non-trivial to implement. Anyway, in those cases where the modeller can and wants to do this, he or she can add suitable assert()
statements to the model, while providing an empty list of variables for reference trajectory generation. This is already possible now, without the need to change the Modelica language specification.
At this point I guess having a separate
verificationExperiment
annotation is not really a good idea. Maybe we could just add three other fields to the existing experiment annotation, with appropriate defaults:
Again: Why not reuse TestCase
?
The most recent proposal by @casella seems like a good direction.
We should use TestCase
for these new values though, as I mentioned earlier and as @henrikt-ma reiterates in the previous comment.
StopTimeVerification = StopTime
to be used both for reference generation and verification simulations
If placed inside the TestCase
annotation, it could be called just StopTime
. Alternatively, I'd suggest ReferenceStopTime
as it is the stop time of the reference.
ToleranceReference = 0.1*Tolerance
to be used for reference generation
The point of having a (Reference)ToleranceFactor = 0.1
instead is that if you override the default and then, much later, decide to use another Tolerance
, you avoid the risk of suddenly having the same tolerance for reference generation as for verification, hence preserving the sanity check that Tolerance
is reasonably set.
ToleranceVerification = Tolerance
to be used for verification simulations
I think this is a bad idea, as it only opens up for having inadequate Tolerance
set in models. Instead, there should be ways of adjusting comparison tolerances so that the Tolerance
set in the model can pass verification even results are a bit (but not too much) sloppy for simulation performance reasons.
Instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.
This can be done in some cases, but not always; also, it may be non-trivial to implement. Anyway, in those cases where the modeller can and wants to do this, he or she can add suitable
assert()
statements to the model, while providing an empty list of variables for reference trajectory generation. This is already possible now, without the need to change the Modelica language specification.
Agreed! I merely wanted to point this alternative approach out as a way out of the "chaotic behaviour vs. reference trajectories" conundrum. Sorry for not making this clearer from the outset.
I also need to mention an elephant in the room before we fall too deeply in love with Tolerance
and related properties such as ReferenceToleranceFactor
:
… and the relative integration tolerance
Tolerance
for simulation experiments
That is, we are discussing the control of local error for some integration method (with adaptive step size) that hasn't been defined. Hence, the global error – which is what matters for verification of simulation results – will generally be bigger with increased number of steps taken. To avoid this arbitrary relation to the integration method chosen by the tool, it would make way more sense to set an ErrorGrowth
goal for local error per step length, but I am afraid that our community is stuck with integration methods and step size control that don't support such mode of operation.
This is an area where I hope that the modern and flexible DifferentialEquations.jl and the active community around it could benefit us all; that if we just keep nagging @ChrisRackauckas about this opportunity to lead the way, they will eventually set a new standard for error control that the rest of us will have to catch up with.
It is not that simple - tolerance can be local error per step or per step length (normalized or not).
However, looking at global error has two issues:
However, I don't see that such a discussion is relevant here.
However, I don't see that such a discussion is relevant here.
My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what Tolerance
really applies to.
My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what
Tolerance
really applies to.
@henrikt-ma my expectations are based on several years of experience trying to match Dymola-generated reference values with OpenModelica simulations on several libraries, most notably the MSL and Buildings, which contain a wide range of diverse models (mechanical, themal, electrical, thermo-fluid, etc.). The outcome of that experience is that the current set-up works nicely in 95% of the cases, but then we always get corner cases that need to be handled. If we don't, we end up with an improper dependence of the success of the verification process on the tool that actually generated the reference data, which is something we must absolutely get rid of, if we want the core statement "Modelica is a tool-independent modelling language" to actually mean something.
I am convinced that my latest proposal will enable to handle all these corner cases nicely. Of course this is only based on good judgement, I have no formal proof of that, so I may be wrong, but I guess we should try to make a step forward, as the current situation is not really tenable for MAP-Lib. If this doesn't work, we can always change it, the MLS is not etched in stone.
As to the meaning of Tolerance
, the MLS Sect. 18.4 defines it as:
the default relative integration tolerance (Tolerance) for simulation experiments to be carried out which is a bit (I guess deliberately) vague, but widely understood as the relative tolerance on the local estimation error, as estimated by the used variable-step-size solver. In practice, I understand that parameter is just passed on to the relative tolerance parameter of integration routines.
The point of this parameter is not to be used quantitatively, but just to have a knob that you can turn to get more or less precise time integration of differential equations. In most practical cases, experience showed that 1e-4 was definitely too sloppy for cross-tool verification, 1e-6 gives satisfactory results in most cases, but in some cases you need to further tighen that by 1 to 3 orders of magnitude to avoid numerical errors play a too big role. That's it 😅.
StopTimeVerification = StopTime
to be used both for reference generation and verification simulationsIf placed inside the
TestCase
annotation, it could be called justStopTime
.
Sounds good. KISS 😃
ToleranceReference = 0.1*Tolerance
to be used for reference generationThe point of having a
(Reference)ToleranceFactor = 0.1
instead is that if you override the default and then, much later, decide to use anotherTolerance
, you avoid the risk of suddenly having the same tolerance for reference generation as for verification, hence preserving the sanity check thatTolerance
is reasonably set.
I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?
That was the reason for specifying the tolerance for reference generation directly, instead of doing it with some factor. At the end of the day, I guess it doesn't matter much and it's mostly a matter of taste, in fact GUIs could take care of this aspect, like giving the number of intervals which is then translated into the Interval annotation by computing StopTime-StartTime/numberOfIntervals.
ToleranceVerification = Tolerance
to be used for verification simulationsI think this is a bad idea, as it only opens up for having inadequate
Tolerance
set in models.
This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?
I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?
Yes; if the library developer can't even make the simulation results match a reference generated with more strict tolerance, I would argue that the reference shouldn't be trusted. In my experience, it doesn't need to be 100 times smaller than Tolerance
to serve as a sanity check, 10 times seems to be enough difference. I can also imagine that a factor of 0.01 might too often give solvers problems dealing with too tight tolerance, which is why I'm suggesting 0.1 to be the default. Before a default is decided upon, however, I suggest we try some different numbers to see in how many cases the default would need to be overridden; if we can manage that number with a default of 0.01, I'd be in favor of that since it gives us an even stronger indication of the quality of the reference results in the default cases.
I'd be very sceptic about using a ReferenceToleranceFactor
above 0.1 for any test, as it would show that the test is overly sensitive to the Tolerance
setting. For instance, if another tool uses a different solver, one couldn't expect that the Tolerance
in the model would work.
ToleranceVerification = Tolerance
to be used for verification simulationsI think this is a bad idea, as it only opens up for having inadequate
Tolerance
set in models.This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?
That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification
, we ensure that verification always corresponds to what the user will actually see.
For the record, here are some required tolerance changes for Buildings in order to avoid cross-tool verification issues: PR lbl-srg/modelica-buildings#3867
That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing
ToleranceVerification
, we ensure that verification always corresponds to what the user will actually see.
The problem is, we don't only want to see that. We also want to make sure that if you tighten the tolerance, you get closer to the "right" result. Regardless of what the user experience is.
Modelica models include an
experiment
annotation that defines the time span, tolerance and communication interval for a default simulation of the system. Usually, these parameters are set in order to get meaningful results from the point of view of the modeller. Since in many cases the models are affected by significant parametric uncertainty and modelling assumptions/approximations, it typically makes little sense to seek very high precision, say rtol = 1e-8, resulting in longer simulation times, when the results are affected by maybe 1% or more error.We are currently using these values also to generate reference results and to run simulation whose result are compared to them, both for regression testing and for cross-tool testing. This is unfortunately not a good idea, mainly for two reasons:
For testing, what we need is to select simulation parameters which somehow guarantee that the numerical solution obtained is so close to the exact one that numerical errors cannot lead to false negatives, so that a verification failure really means something has changed with the model or the way it was handled to generate simulation code.
In both cases, what we need is to clearly differentiate between the role of the experiment annotation, which is to produce meaningful results for the end-user of the model, and of some new annotation, which is meant to produce near-exact results for comparisons.
For case 1., what one typically needs is to provide a tighter tolerance, and possibly a shorter communication interval. How much tighter and shorter, it depends on the specific case, and possibly also on the set of tools involved in the comparison - there is no one-fits-all number.
For case 2., it is necessary to choose a much shorter simulation interval (smaller StopTime), so that the effects of chaotic motion or the accumulated drift of event times don't have enough time to unfurl significantly. Again, how much shorter, it depends on the participant to the game, and may require some adaptation.
For this purpose, I would propose to introduce the
verificationExperiment
annotation, with exactly the same arguments as theexperiment
annotation, to be used for the purpose of generating results for verification. Of course, if some arguments (e.g.StartTime
) are not supplied, or if the annotation is outright missing, the corresponding values of experiment annotation (or their defaults) will be used instead.