Define a verificationExperiment annotation for experiments meant to compare results for regression testing and cross-tool comparisons

modelica / ModelicaSpecification

Specification of the Modelica Language

https://specification.modelica.org

Creative Commons Attribution Share Alike 4.0 International

104 stars 40 forks source link

Define a verificationExperiment annotation for experiments meant to compare results for regression testing and cross-tool comparisons #3473

Open casella opened 9 months ago

casella commented 9 months ago

Modelica models include an experiment annotation that defines the time span, tolerance and communication interval for a default simulation of the system. Usually, these parameters are set in order to get meaningful results from the point of view of the modeller. Since in many cases the models are affected by significant parametric uncertainty and modelling assumptions/approximations, it typically makes little sense to seek very high precision, say rtol = 1e-8, resulting in longer simulation times, when the results are affected by maybe 1% or more error.

We are currently using these values also to generate reference results and to run simulation whose result are compared to them, both for regression testing and for cross-tool testing. This is unfortunately not a good idea, mainly for two reasons:

in some cases, the numerical errors exceed the tolerance of the CSV-compare tool, so we get a lot of false negatives because different tools, or different versions of the same tool, or even the same tool on different hw/ws platforms (see, e.g., OpenModelica/OpenModelica#11935) lead to different numerical errors/approximations;
some other cases feature chaotic motion (e.g. the Furuta Pendulum or three-body problems) or large amounts of closely-spaced state events (e.g. all kind of switched circuit models) whose triggering time inevitably tend to drift apart due to accumulating errors in determining the exact switching times.

For testing, what we need is to select simulation parameters which somehow guarantee that the numerical solution obtained is so close to the exact one that numerical errors cannot lead to false negatives, so that a verification failure really means something has changed with the model or the way it was handled to generate simulation code.

In both cases, what we need is to clearly differentiate between the role of the experiment annotation, which is to produce meaningful results for the end-user of the model, and of some new annotation, which is meant to produce near-exact results for comparisons.

For case 1., what one typically needs is to provide a tighter tolerance, and possibly a shorter communication interval. How much tighter and shorter, it depends on the specific case, and possibly also on the set of tools involved in the comparison - there is no one-fits-all number.

For case 2., it is necessary to choose a much shorter simulation interval (smaller StopTime), so that the effects of chaotic motion or the accumulated drift of event times don't have enough time to unfurl significantly. Again, how much shorter, it depends on the participant to the game, and may require some adaptation.

For this purpose, I would propose to introduce the verificationExperiment annotation, with exactly the same arguments as the experiment annotation, to be used for the purpose of generating results for verification. Of course, if some arguments (e.g. StartTime) are not supplied, or if the annotation is outright missing, the corresponding values of experiment annotation (or their defaults) will be used instead.

HansOlsson commented 9 months ago

I agree that such information is useful, and ideally it should only be needed for a fraction of the models.

I believe that what we have used for Dymola is instead of modifying the package that information is available externally - and also to loosen the tolerance so that we don't get false positives; but I will check.

As I can see that has a number of advantages:

No need to push it up-stream to library developers.
No need to decide which tool has the correct idea about such tolerances; as it is available externally from the library.

henrikt-ma commented 9 months ago

I'm not 100% happy with this direction, as it means that the default will be to use the settings from the experiment-annotation, meaning that we don't automatically detect that decreasing tolerance by an order of magnitude has significant impact on the result.

I see good reasons to generate reference results according to a procedure where tolerance is automatically set an order of magnitude lower than that in the experiment, so that CI correctness tests running with the experiment settings can detect when the experiment is inappropriate.

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this. Similar to what @HansOlsson's says regarding Dymola, we at Wolfram also keep information for overriding the experiment in files that are separate from the Modelica code, with the ability to both control reference result generation and settings for running tests (used in rare situations as an alternative to using very sloppy comparison tolerances when the experiment settings are considered really poor).

A related topic is the use of different StopTime for regression testing and for model demonstration, but this might be better handled with a TestStopTime inside the normal experiment.

christoff-buerger commented 9 months ago

I honestly would not introduce any new annotations etc to the language. In my opinion Modelica already has all the means one needs to achieve this, e.g., inheritance and modification. What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

For example, you can have basic user experiments in your library (i.e., examples) and in a separate regression test library refine these experiments with different setting (change tolerances, stop times whatever). In this setup, the regression testing setup is not part of the final library distribution, as it likely should be (because you might add a bunch of scripts and other software for a continuous integration support resulting in huge dependencies to make your automatic tests run). You just don't want to ship all of that. And as @henrikt-ma mentions, good pratice at the tool vendors is already to have separate settings for this.

I think, testing is a matter of project policies and templates. MSL can come up with a good scheme for its use case and improve its infrastructure without any need for language changes.

casella commented 9 months ago

I respectfully disagree with @henrikt-ma and @christoff-buerger:

the choice of specific simulation parameters for regression testing is the ultimately the responsibility of the library developer, not of the tool vendors. I understand this is only necessary for a minority of cases (maybe 2-5%, most cases work fine with default settings), but yet if one aims at seeing 100% verification, this will be necessary)
tightening the tolerance is not enough for examples that show chaotic motion or have large number of events. In those cases, the only hope to match the results across tools is to keep the simulation short. On the other hand, if one wants to demonstrate chaotic motion, of course the experiment annotation must be long enough to actually show that. So, there are clearly conflicting requirements for the experiment annotation and for the verificationExperiment annotation, hence the need for two separate annotations
regarding the MSL, it has been common practice for 25 years to use all runnable examples of the Modelica library for regression testing. If we had infinite resources, we could stop doing that and develop another library with 500+ test cases that are explicitly set up for verification and regression testing only. Unfortunately, we don't have those resources, MAP-Lib is not a for-profit tool vendor. I believe adding a verificationExperiment annotation to the few dozens of test cases that really need them would be much more practical way to achieve the same goal
if we want (as we plan to) to extend verification and qualification services by the MA to all open-source libraries, not only the MSL, the lack of infinite resource issue would become even more important
all this of course does not prevent tool vendors to perform whatever kind of regression testing they want with their own tools. Tool vendors are still free to check what happens if the experiment annotation tolerance is reduced by a factor 10. The aim here is for library developers to have a means to declare what is a meaningful experiment for cross-tool verification, in a way that is as much as possible tool-independent. Once again, this not only concerns the tolerance, but all the simulation parameters: StartTime, StopTime, Interval, an Tolerance. And possibly more that we still don't have, e.g. suggesting specific solvers that have certain stability regions, which is necessary in some advanced applications

Regarding this comment

What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

it is true that each MA project has its own responsibility, but the whole point of the MA is to provide coordinated standards and libraries. In this case, I clearly see the need for a small language change to support the work by MAP-Lib. BTW, we are talking about introducing one new annotation, which is no big deal. We could actually introduce a vendor annotation __MAP_Lib_verificationExperiment, but this really seems overkill to me, can't we just have it as a standard Modelica annotation?

@GallLeo, @beutlich, @dietmarw, @AHaumer, @hubertus65, I'd like to hear your opinion.

GallLeo commented 9 months ago

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I had a look at the "heuristics" proposed 10 years ago, when I started creating the first set of reference results: https://raw.githubusercontent.com/wiki/modelica/ModelicaStandardLibrary/MSLRegressionTesting.pdf#page=11 Takeaway: These rules worked quite fine for most test cases (at least much better as if we had used tolerance 1e-4 everywhere). Back then, I was not able to use a tight tolerance everywhere, because some models would only simulate with their default tolerance. Are these models to be considered "bad models"? Or should we be able to test any working model?

After the 4.0.0 release, @beutlich documented his adaption of tolerances in the MSL wiki: https://github.com/modelica/ModelicaStandardLibrary/wiki/Regression-testing-of-Modelica-and-ModelicaTest Takeaway: Finding the "right" strict tolerance for MSL examples still needs manual work.

So, in order to make it transparent, I'm in favor of explicitly specifying verification settings. But, then we have to limit the manual work for setting these up as much as possible: If a example/test case is changed, should the verification setting be updated automatically? Example: If you change stop time of an example from 1000 s to 2000 s, interesting dynamics could be missed, if you keep your verification stop time ate 1000 s.

Where to specifiy the explicit verfication settings?

@christoff-buerger proposed using extends. We could do that in ModelicaTest, without manually writing a new library: ModelicaTest.Modelica...MyExample extends Modelica...MyExample annotation(experiment(...)); Benefit: All MSL-test cases would reside in ModelicaTest (no need to run two libraries) Drawback: CI or library developer has to generate the extends-test-case in ModelicaTest. If the test case for a new example is not generated, it will be missed in regression runs. If the experiment setup of an example is changed, it will not be automatically updated in the test case.

@HansOlsson proposed storing the verifications setup externally. This might work, as long as you test in one Modelica tool and this Modelica tool cares about the verification setup. I'm unsure, how to handle it in multi-tool environments. We already have the burden of updating Resources/Reference/Modelica/.../MyExample/comparisonSignals.txt

Seperate files are very hard to keep up to date, therefore storing verification settings in the example/test case seems the be right place to me. Especially, if we think about using signals from figures annotation as important comparison signals.

I would add arguments to experiment instead of new verificationExperiment. Reasons:

It's easy to keep sync of the default experiment settings and the verification experiment settings.
There are tools adding vendor-specific parts to the experiment annotation in order to "nail down" the settings for a specific tool (e.g. __Dymola_Algorithm="Cvode"). The used integration algorithm is very important, as soon as you are not only testing for correctness ("exact solution" of the DAE). Most people also test for run time of their test cases. Drawback: Complicated 'experiment' annotations in the MSL release distract new users.

beutlich commented 9 months ago

I am pretty much aligned with @GallLeo here.

Example models

Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse. In strict consequence there should be no directory in Modelica/Resources/Reference/ at all.

Reference models

Reference models should be taken from ModelicaTest. If example models from Modelica shall be also considered for regression testing, it might be an option to extend from it and adopt the experiment settings. This would also simplify the job of the regression tester since only one library needs to be taken into account.

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

Specify reference signals outside the model in ModelicaTest/Resources/Reference/ModelicaTest/ and take solver settings from the experiment annotation. This is status-quo (for models in ModelicaTest). It is error-prone in the sense that signal files can be missing or wrong (duplicated or invalid signal names).
Specify both reference signals and solver settings in the model. Yes, that could distract MSL users, even if only in ModelicaTest.
Specify both reference signals and solver settings outside the model. The only requirement we have is that it should be a machine and human redable text-based file format (say TXT (as now for the comparisonSignals.txt) or some more structured YAML with comments and syntax highlighting). The main advantage I see is, that I do not need a Modelica parser or editor to obtain and modify these settings, it's all there in a specified file/directory location which can be controlled from the test runner engine and easily format-tested by linters/checkers. Of course, it again is error-prone in the sense of missing files or wrong signals. (You might compare it to the Modelica localization feature where we also keep the localization outside the library iself in a dedicated directory location.)

Summary

I am in favour to keep reference models outside Modelica and only use ModelicaTest for it. I am in favour to not have reference signals and experiment settings distributed in various locations, i.e., option 2 or 3 are my preferences. I even see more advantage in option 3. In that case it is left to specify (if at all)

an optional annotation where to find these regression settings files
how to name them
the file format and structure

I still need to be convinced that we need to have this in the specification or if it is simply up to the library developers of ModelicaTest. (Remember, it is now all in ModelicaTest and not any more in the MSL.)

dietmarw commented 9 months ago

I would also be in favour of having the verification settings separate from the simulation model but referenced from within the simulation model. So basically @beutlich's option 3.

HansOlsson commented 9 months ago

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I agree that many of those rules will not be adopted by all libraries.

Additionally to me use-cases such as this is one of the reasons we want the possibility to have the information completely outside of the library; so that the user of the library can add the things needed for testing - without changing the up-stream-library.

AHaumer commented 9 months ago

IMHO @casella is right with his analysis and his idea. I'm not fond of storing additional to the model the comparisonSignals.txt and separate settings for comparison / regression tests. It is much better to store this information (stop time, interval length, tolerance, what else?) within the model, either a second annotation or as part of the experiment annotation. If not present, the setting of the experiment annotation are valid (in most cases). Additionally we could specify the tolerance for comparison. The model developer decides whether changes for comparison / regression tests are necessary or not, maybe after receiving a message that comparison / regression tests are problematic (could be from tool vendors). As this annotation would be supported by most tools, this could be adopted by all third-party libraries. Regarding test examples: I think it's a good idea to use all "normal" examples that are intended to demonstrate "normal" usage, and additionally "weird" examples from Modelica.Test.

maltelenz commented 9 months ago

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I also want to make people aware of the TestCase annotation we already have, which is a candidate for a place to add test related things.

HansOlsson commented 9 months ago

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:

Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash
Comparing it with the reference (will be outside the bounds if using original end-point)

Note that there can be other reasons than chaos for separating testing from running, e.g.:

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.) I don't know why the StopTime is 1.01 s instead of 1 s.

maltelenz commented 9 months ago

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:
* Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash

* Comparing it with the reference (will be outside the bounds if using original end-point)

Thank you for clarifying this. I agree the test (in a perfect world) in this case should include both these steps.

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.)

As a user of the model, I would not want to see the weird under-sampling of the engineTorque either. How am I supposed to know if it is the actual behavior?

The file size issue for testing can be dealt with by the tool only storing the variables you need for the comparison. For the user, if we introduce figures in the model, tools could be smarter there as well and only store (by default) what you need for plotting (and animation).

henrikt-ma commented 9 months ago

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

At Wolfram, where we have somewhat extensive use of figures in the models, we have the default policy to simply use all variables appearing in figures as comparison signals (@GallLeo already briefly mentioned this idea above). It has several advantages:

No separate comparisonSignals.txt files to keep track of.
What is visible in the figures is much more likely to have been verified by the engineer behind the model.
What is visible in the figures has little risk of being ill-defined internal helper variables that could be problematic for correctness testing.
Variables that are interesting for comparison, but which don't fit into any of the key figures of a model can be placed in figures under a suitably named Figure.group (Regression Testing, for instance).

One small limitation of the approach is that there is no way to say that a variable in a figure should be excluded from correctness testing. I'm not aware of any model where we have needed this, but still… As @maltelenz suggested, the TestCase-annotation could be the perfect place for this sort of information. For example:

annotation(TestCase(excludeVariables = {my.first.var, my.arr[2].y}));

henrikt-ma commented 9 months ago

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I think it is well understood that there isn't one Tolerance that will work for all models. What I'm afraid is less well understood is that it is very hard to detect this when reference results are generated with the same tolerance which is used when running a test. To avoid this, I'd like something like the following:

annotation(TestCase(experiment(toleranceFactor = 0.01)));

The default for toleranceFactor should be something like 0.1, and I would expect that there would rarely be a reason to override the default; when there is a correctness error due to tolerance settings, one would fix this by adjusting the Tolerance, not by using a toleranceFactor closer to 1.

mwetter commented 5 months ago

I am also in favor of adding a new annotation, or an entry to the experiment annotation, that allows specifying data needed to produce result files for cross comparison. Our current setup is to simply increase the tolerance by a factor of 10, but such a global policy turns out to be unsatisfactory. In my opinion, it is the modelers responsibility to add this information. Most of our contributors are not "professional" developers but mainly researchers or student who use the library and occasionally develop contributions so asking them to put all information inside the .mo file would be preferred.

Therefore, having model-specific specifications for the CI tests would be valuable. These specification should include

what tolerance to use for normal CI testing as a user would run
what tolerance to use for tool cross-checking,
what trajectories to test for in normal CI testing
what trajectories to exclude for results tool cross-checking if very noisy
what change in StopTime for the results tool cross-checking if a model is chaotic
this could be augmented with (vendor-specific?) entries for specific tools, for example to not test a certain model in tool X because the tool does not yet work for that model, or to increase the tolerance only for tool X. These are information that are needed in our tests.

Currently we have (some of) these information spread across different files for about 1700 test models. This is due to historical reasons. It would be good to refactor this as the current situation makes it hard for new developers to figure out what is to be specified where, and it makes it harder to explain how to port changes across maintenance branches. Having these inside the .mo annotation is my preferred way.

HansOlsson commented 2 months ago

This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.

henrikt-ma commented 2 months ago

This seems quite important, we can mention at the next phone meeting - but we will work a lot on it outside of those meetings as well.

Please just make sure to keep the language group in the loop. Is it the MAB-Lib monthly meeting one should participate in to engage in the work, or some other forum?

casella commented 2 months ago

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

Conceptually speaking, it's both things, as also mentioned by @mwetter in his comment below. Whether or not we want to put the information about how to create the reference data in the same annotation or in a different annotation it's just a matter of taste but does not change the fundamental requirement (for me), i.e., that all the relevant information for testing a model should be embedded in the model through annotations, not spread over other files. We already have this concept for documentation, I don't really see why testing should be different.

Regarding storing the names of the variables to be used for automated comparisons (possibly using regexp to keep them compact) I also think they should have a place in an annotation, rather than in a separate file, for the same reason the documentation is embedded in the model and not stored elsewhere. I agree with @henrikt-ma that we should make good use of the information provided by the figures annotation by default, because in many cases variables that you want the user to see are also good to check if the simulation results are correct. But these two requirements are not necessarily identical (see my next comment), so there should be means to declare that some plotted variables should not be used for testing, or that some more variables should be tested that are not plotted.

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this

@henrikt-ma with cross-tool checking of the Buildings library, this happens in a significant number of cases, and it's really a nuisance if you want to make the library truly tool-independent with a high-degree of dependability. @mwetter can confirm. The solution, so far, was to tighten the tolerance in the experiment annotation, e.g. to 1e-7 for all those cases. This is really not a clean solution, because those simulations become unnecessarily slower and, most importantly, the difference in the obtained results is small, much smaller than the modelling errors inherent in the simulation model, which of course has lots of modelling approximations. As I'll argue once more in the next comment, the requirements for human inspection are completely different from the requirments for automatic cross-tool checking.

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.

casella commented 2 months ago

I re-read all the comments, and I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves. Let me add some extra considerations about that, which I hope can further clarify the issue.

The goal of libraries such as ModelicaTest is not (necessarily) to run cross-tool comparisons. It is (ideally) to test each reusable component in the library at least once, and each modelling option of such components at least once, using test models that are as simple as possible and whose outcome is somewhat predictable (ideally, a closed-form analytic solution should be available). In other words, the goal of ModelicaTest is to demonstrate that the component models are correct and do what they are expected to do. From this point of view, the requirement that we have for the MSL is that if you develop a new component, you should also add tests in ModelicaTest that demonstrate its correct implementation. Conversely, the goal of the Examples sub-package is to demonstrate the use of the library to model real systems. It is usually simple ones, but yet it is complete systems, not test benches for individual components.
From this point of view, I respectfully but completely disagree with @beutlich's statement "Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse". Examples of real systems are the actual use cases where it is of the utmost importance that different tools produce the same result. Even more than the output of the component tests.
IMHO we have two completely orthogonal partitions of test cases. On one axis, we have models that test individual component (contained in ModelicaTest) vs. models demonstrating complete systems (contained in Modelica.XXX.Examples). On the other axis we have simulations meant to be inspected by humans to understand the behaviour of a component or system vs. simulations meant to automatically compare the output of different tools to catch regressions or tool implementation bugs. The first axis is handled by different libraries, the second can be handled by different experiment annotations. Unless we want to have four libraries, one for each combination of these aspects, which doesn't really seem a reasonable proposition to me.
As I already mentioned in my initial post, the requirements for system simulations to be inspected by humans, e.g. for design purposes, are fundamentally different from the requirements for system simulations meant for automatic cross-check and regression testing. On one hand, the former need not be super-accurate, because all models are approximated, so there is no point running very slow simulations with Tolerance = 1e-8 if the model has a 10% uncertainty; additionally, they may legitimately show behaviour, such as chaotic motion (the double pendulum) or extremely large number of oscillations (the EngineV6 model) or event-triggered commutations (AC/DC converter models), that are interesting per se but hard to reproduce exactly across tools. On the other hand, if you want to do regression or cross-tool checking, you need to run simulations which are as accurate as humanly (machinely?) possible, so that failures in verification can be attributed with certainty to tool issues and not to numerical errors; this requirement is completely different from the requirement of getting accurate enough simulations given the approximations of the model. Notice that this conceptually holds both for system examples (as in Modelica.XXX.Examples) and component tests (as in ModelicaTest).

Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.

I agree with that, with a few important remarks

the extremely rare cases are rare, but not extremely. Based on my many-year experience trying to match OpenModelica's results with Dymola's, it's probably something like 2-5% of the test cases. If we want to adopt a tool-neutral approach to reference result generation and regression testing for MSL, while eventually achieving 100% success rate with all tools, they need to be handled properly.
this argument is in fact in favour of having a specific verificationExperiment annotation, only for those few cases that need it. Currently, we always generate the reference trajectory and do regression testing with a 10X smaller tolerance by default. Which is not enough in some cases, but in most other cases actually does what @maltelenz argues against, i.e., it doesn't test what the user experiences. If we had such a verificationExperiment annotation, we could use it for the (relatively few) examples where it is really needed to use different settings in order to do reliable automatic testing, and use the default experiment annotation in all other cases
BTW, nothing would prevents tool vendors from doing regression and cross-tool checking both with the regular experiment annotation and with the verificationExperiment annotation, if both are present. It goes without saying that numerically tricky models may legitimately fail the former and pass the latter, for the reasons argued above.

I also want to make people aware of the TestCase annotation we already have, which is a candidate for a place to add test related things.

Good point.

bilderbuchi commented 2 months ago

I noticed that there are two rather different opinions: according to the first one, the information to run cross-tool verifications should belong to completely separate entities, e.g. separate libraries such as ModelicaTest, according to the other instead, the information should be embedded in the models themselves.

Just a quick remark: this division is not terribly surprising to me, because it can also be observed in general in programming languages' testing tools -- in my personal experience there seem to be (at least) two major "schools":

Tests-with-implementation puts the code/information needed for testing close to the tested implementation, i.e. typically in the same file. Examples are GoogleTest, Catch2 or doctest.
Tests-separate collects the testing code in a place separate from the implementation being tested, e.g. in a separate module or folder. Examples are Python's pytest or unittest, or Julia's built-in unit testing.

I'll refrain from discussing respective pros/cons, since much of this comes down to personal preference, and to avoid bloating the discussion here. I just want to point out that the two approaches have widespread use/merit, and which one is "better" depends on a load of factors, probably.

As far as my experience is concerned, what sets Modelica somewhat apart from the ones above is the noticeable dominance of "reference testing"/golden master/regression testing. The approach of asserting smaller facts about models' behaviour, e.g. like XogenyTest proposed, seems to have very little to no use from my vantage point. I guess much of this is owed to the specific problem domain (in essence, time-dependent ODE simulation), but I'm wondering if an assert-oriented testing approach (i.e. testing specific facts about a component/model, not the whole set of time traces) could alleviate some of the problems encountered?

maltelenz commented 2 months ago

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

@maltelenz, I don't see the point of using a tighter tolerance for the reference creation, and then using a sloppy one to generate results that will fail the the verification test because of numerical errors. If you compare results obtained with different tools (and, mind you, the reference result will be one of them!), of course the tolerance should be the same.

I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements. The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

If we instead generate the reference result with a sloppy tolerance from the experiment, it could already have results that are on the "edge" of the "band" around the perfect solution that we consider acceptable. If a different tool then gets results that are the same distance away from the perfect solution, but on the other edge of the band, it will fail the test.

casella commented 2 months ago

As a last remark (I apologize for flooding this ticket), it is my goal as the leader of MAP-Lib to eventually provide an MA-sponsored infrastructure that can test open-source libraries (starting from the MSL and ModelicaTest, of course) by running all models that can be simulated with all Modelica tools, automatically comparing their results and visualizing the outcome of the comparisons in a web interface. This would allow the library developer(s) to easily inspect the results, and eventually pick the results of any one tool as reference results, based on his/her expert judgement, so that he/she is not limited in the development to the availability of tool(s) for which he or she has a license. It will also allow the library users to pick the best tool(s) to run a certain library. I believe this is the only way we can truly foster the concept of tool-independent language and libraries, moving it from theory to practice.

In order for this thing to work, we have to lower the entry barrier as much as we can, so that we get as many libraries as possible in. As I envision it, one could start by asking for his/her library to be tested. He/she will then be able to compare the results with different tools, which may also indirectly point out models that are numerically fragile, and eventually select reference trajectories among the ones that were computed, not necessarily with the tool(s) that he or she has installed on his computer. In most cases, assuming the tools are not buggy and that the model is kosher with respect to the language spec (a mandatory requirement IMHO), selecting any specific tool result obtained with the default experiment annotation as a reference will cause all other tool results to fall within the CSV-compare tubes, so everybody's happy.

For a few corner cases (2-5%?) it will be clear that the model is numerically trickier, so the library developers may introduce a verificationExperiment annotation to determine tight enough conditions that allow to reduce the numerical errors below the CSV-compare tool tolerance, so that all tool results are close enough. In some cases, it will be clear that the results of some tools are plain wrong, and this will provide useful information to tool developers for bug fixing. Some other times, every tool could give a different result, which may be a strong indication of a numerically fragile model, which is useful information for the library developer. This is also an essential feature for MSL development: developing the "Standard" library with a single tool is not really a viable proposition, the developer needs as much feedback as possible from all MLS compliant tools to develop something truly standard.

Once this process has converged, the library will be usable with all tools and will have all the information embedded within it to allow continuous testing of this property, which will be publicly demonstrated on the MA servers. I believe this would be a very strong demonstration of the power of the Modelica ecosystem.

Now, the crucial point to make this dream come true is that the additional effort for the library developer to enter this game will need to be as low as possible, otherwise this is simply not going to happen.

With my proposal, and with some reasonable heuristics to pick meaningful comparison variables absent their indication (e.g. only the state variables for dynamic models) one can run cross-tool comparisons of an open-source library with zero additional effort by the library developer. The idea is that the majority of tested models will be fine, and it will only be necessary to provide a few custom annotations for the critical models, e.g. selecting specific variables for testing or changing the verificationExperiment annotation.

BTW, this task is conceptually under the responsibility of the open-source library developer(s), which however may have not enough time or motivation to take care of it. What is nice it that other parties (i.e., the Modelica community) could help in this process by means of pull requests to the library code base that introduce such annotations where needed. These pull requests with a few added annotations can preliminary be fed to the MA testing infrastructure, so that the library developer can see the result and accept the PRs with just one click of the mouse, if the results are fine. Which means, he or she can easily put in the judgement, without much effort. This is how we'll get things done. We could get student involved in this work, which would also promote Modelica with the younger generations.

IMHO this kind of process has a much higher chance of practical success than a process where we expect that Library Officers or open-source Library Developers (which are usually volunteers doing this in their spare time) have unlimited time and resources to set up elaborate testing libraries with ad-hoc developed test models, elaborate rules, multiple files, scripts and whatnot. Ideally, this may be the best option, but in practice, this is never going to happen. We need a viable path to achieve the dream I envisioned, and I believe that the verificationExperiment annotation, alongside with annotations for selecting the reference variables, is a key feature for that.

My 2 cts as MAP-Lib leader. 😃

casella commented 2 months ago

I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements.

I agree, except for the case of chaotic systems, for which there are theoretical reasons why even a very, very tight tolerance doesn't work in the long term, due to exponential divergence of very close trajectories.

The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?

Anyway, the problem here is that a solution with a 2% error, obtained from a model that has 10% uncertainty on key parameters (which is pretty normal for thermal systems) may be perfectly fine for all practical purposes, except automated verification, for which we have (rightfully) set a 0.2% relative tolerance. One thing is to get a result which is good enough for some application, another thing is to verify that two different tools give the same result if the numerical approximations are good enough.

Why should we unnecessarily use a tigher tolerance in the experiment annotation, hence longer simulation times, to stay within the CSV bounds, which have no consideration for the model uncertainty?

The fundamental issue addressed by this ticket is that the requirements for simulations are completely different whether you want to use the results to make decisions about a system being modelled, or you want to use the simulation result for cross-tool and regression checking. Different requirements lead to different simulation setups. Hence, different annotations to specify them.

HansOlsson commented 2 months ago

I would partially agree with @beutlich comment.

To me ModelicaTest is not only models testing each component once, but intended as unit-tests of components; so if there's a weird flag there should be a test-model for that. As always coverage and testing goes hand in hand.

It might be good if we also added some "integration tests" (in the testing meaning) - and to me that would fit naturally in ModelicaTest, but perhaps separated in some way. However, I understand that we might not have resources for that.

In contrast, the Examples models in MSL are primarily constructed as Examples demonstrating how to use (and sometimes not use) the library. Thus in one sense using them for (cross-tool-)testing is misusing them, but on the other hand we have them and they should work reliably (at least in general: see #2340 for an exception, and as discussed we also have chaotic systems as another exception) - so why not test them as well?

However, we should still keep the mind-set that Example-models are examples - not testing-models.

After thinking through this I think that one conclusion is that since the Examples aren't primarily for testing and they shouldn't be the only tests for the library, it should be ok to reduce the simulation time considerably for specific Example-models (due to chaotic behavior or too large results).

Whether such information is stored in the library or separately is a somewhat different issue. Both should work in principle, but I could see problems with getting agreement on the exact levels for different tools - and thus the need to add something outside of the library for specific cases and tools - which means that if we start inside the library we would have both.

gwr69 commented 2 months ago

Just wondering: Wouldn't it be more sensical to compare the distributions generated by chaotic systems instead of comparing data points sequentially within a short interval?

HansOlsson commented 2 months ago

Just wondering: Wouldn't it be more sensical to compare the distributions generated by chaotic systems instead of comparing data points sequentially within a short interval?

Possibly, but in order to get a good reference for the distribution one would need to simulate a lot more. Clearly there are also entirely different ways of getting the distribution - but I think that is beyond this issue

maltelenz commented 2 months ago

I believe the idea is that there is a "perfect" solution, if you had infinitely tight tolerance requirements.

I agree, except for the case of chaotic systems, for which there are theoretical reasons why even a very, very tight tolerance doesn't work in the long term, due to exponential divergence of very close trajectories.

Great, I fully agree with you here!

The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?

Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).

The option to run tests both with the user-facing experiment settings, as well as with the same tolerance as the reference generation, would fulfill some of both our wishes. But then, how do you determine if the run with experiment settings was "successful"? You end up with the same question we started with.

Anyway, the problem here is that a solution with a 2% error, obtained from a model that has 10% uncertainty on key parameters (which is pretty normal for thermal systems) may be perfectly fine for all practical purposes, except automated verification, for which we have (rightfully) set a 0.2% relative tolerance.

If the expected behavior of the model with its experiment settings is to vary by 2% (or 10%), does it make more sense to set the relative tolerance when checking the result somewhere in that region then? After all, you are saying that the uncertainty is that large, why are we not trying to match using a way (way way) sloppier check, instead of messing about testing something nobody will experience (special simulation settings for testing)?

HansOlsson commented 2 months ago

The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?

Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).

I can see this point, and in that case the solution seems to be to not tighten the solver tolerance, but to reduce the stop-time or at least the time we compare up to and/or loosen the tolerance when comparing. With the current setup we can get the former by only running the reference up to an earlier point.

For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.

Since the Examples aren't the only tests (or even primarily tests) I don't see that as a problem. ..

However, I can also see that it might make sense to run the tests with a tighter tolerance. The idea is then that we consider three simulation results:

Reference
Tool result 1 (strict tolerance)
Tool result 2 (default, less strict, tolerance)

I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".

If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).

However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.

maltelenz commented 2 months ago

The idea is to have this perfect solution in the reference result. The numerical variations from different tools and sloppier tolerance from the user-facing experiment, will then vary around this perfect solution, hopefully within the accepted testing tolerance considered acceptable.

"Hopefully" is unfortunately a problematic word here, in my experience 😃. Why hoping, if you can just tighten the tolerance also when generating the result to be compared?

Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied (although this can sometimes give very helpful information when debugging models that don't behave as expected).

I can see this point, and in that case the solution seems to be to not tighten the solver tolerance, but to reduce the stop-time or at least the time we compare up to and/or loosen the tolerance when comparing. With the current setup we can get the former by only running the reference up to an earlier point.

For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.

For chaotic systems, I agree something drastic like this is needed, and "test exactly what the user sees" is just not reasonable.

However, I can also see that it might make sense to run the tests with a tighter tolerance. The idea is then that we consider three simulation results:
* Reference

* Tool result 1 (strict tolerance)

* Tool result 2 (default, less strict, tolerance)
I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".

If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).

However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.

I could get behind something like this. I assume "Reference" would also be produced with "strict tolerance" in this case.

gwr69 commented 2 months ago

Again a short call for clarification: What is considered to be the “reference data generating process” in the absence of an exact mathematical function (i.e., solution)?

EDIT: Is it a “perfect” (i.e., reference) implementation which can produce arbitrarily precise (exact) data across all operating systems/implementation instances out there? (I had thought this to be an “expectation” value across a “bag of tools” …)

AHaumer commented 2 months ago

I fully agree with @casella 's opinion, and I fully support his ideas about the testing environment. Just some small remarks: I seems clear to me that a small portion of the models need specific tighter settings for verification. Maybe it is possible to start these models twice:

with tight verificationExperiment-settings for comparison
with original experiment-settings to ensure the example runs for the normal user The tighter verificationExperiment-settings are up to the repsonsibility of the library developer, not of tool developers! It would be a ridiculous cross-tool comparison if different tools use different settings.

bilderbuchi commented 2 months ago

For a chaotic system we would then only see that they start the same, not that the entire simulation is the same as the reference, but it should be good enough.

In these cases, instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.

henrikt-ma commented 2 months ago

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this

@henrikt-ma with cross-tool checking of the Buildings library, this happens in a significant number of cases, and it's really a nuisance if you want to make the library truly tool-independent with a high-degree of dependability. @mwetter can confirm. The solution, so far, was to tighten the tolerance in the experiment annotation, e.g. to 1e-7 for all those cases. This is really not a clean solution, because those simulations become unnecessarily slower and, most importantly, the difference in the obtained results is small, much smaller than the modelling errors inherent in the simulation model, which of course has lots of modelling approximations. As I'll argue once more in the next comment, the requirements for human inspection are completely different from the requirments for automatic cross-tool checking.

Are we even talking about the same automatic settings? What I have in mind was described in https://github.com/modelica/ModelicaSpecification/issues/3473#issuecomment-1933650490. That is, reference results are by default generated with a tolerance 10 times smaller than experiment.Tolerance. This means that before you even get to cross-tool checking, the library developer has to ensure that experiment.Tolerance is set well enough to at least pass tests running with the tool at hand (probably coinciding with the tool used to generate the reference). Once the library and its reference results fulfills this basic requirement on the experiment.Tolerance setting, it becomes much more interesting to see how well it works in cross-tool checking.

If this procedure is thought to cause unnecessarily slow simulations, I would blame that on comparison tolerances being too tight in relation to the purpose of the model. Want more speed? Lower your expectations on correctness, and it should be possible to increase the Tolerance while still being able to match a reference generated with 10 times stricter tolerance. If comparison tolerances are extremely generous so that very big Tolerance could be used, a toleranceFactor of 0.1 could be too big, and this would be one of those rare situations I have yet to see.

casella commented 2 months ago

Because I want to verify that the model will actually work for the user, in the way the user will simulate it. I am less interested if it will work if a completely different set of simulation settings are applied

I see your point. If you want to see that, I guess we should use a much looser tolerance for the CSV-compare tool. Not 0.2%, but probably 5% or maybe more.

The problem is, this may lead to accepting results from tools that are actually doing it wrong, just because they don't do it too much and their error is lost among the tolerance for not-too-small numerical errors.

(although this can sometimes give very helpful information when debugging models that don't behave as expected).

My proposal was implicitly only aimed at this goal. Now that is clearer, thanks for pointing this out.

casella commented 2 months ago

The idea is then that we consider three simulation results:
* Reference

* Tool result 1 (strict tolerance)

* Tool result 2 (default, less strict, tolerance)
I agree that we shouldn't just compare "Tool result 1" with "Reference", but also consider "Tool result 2". Ideally "Tool result 2" should be within 2% (or 10%) of "Reference".

If they differ, and "Tool result 1" is within tolerance of "Reference" it seems a matter of tolerance scaling in the tool, but if "Tool result 1" isn't close to "Reference" it seems a more fundamental issue in the tool (or the model).

However, it seems that part of the discussion is as if "Tool result 2" is close to "Reference", but "Tool result 1" isn't close to "Reference". If that happens there is something very weird.

After considering @maltelenz's arguments, I agree with this proposal by @HansOlsson.

In fact, experience shows that in most cases the results obtained with the experiment annotation Tolerance and StopTime will fall within the 0.2% default tolerance of the CSV-compare tool, so there won't be any need of extra annotations, nor to perform two separate verification simulations, we can go on as we're doing now. However, there will be a small (but non-negligible) number of cases where this doesn't happen.

In these cases we should add an extra annotation with tighter tolerance and possibly shorter StopTime, to handle chaotic systems. We would then run two verification simulations. The first with the default tolerance (but the possibly shorter StopTime) and a more lenient CSV tolerance, e.g. 5%; this ensures that the user experience is not too bad, which is @maltelenz's main concern. The second uses the tigher tolerance and the possibly shorter StopTime, and uses the original 0.2% CSV-compare tolerance; this ensures that once the numerical errors are drastically reduced, the result is very close to the reference and the tool is not acting weird. Both verification tests should pass.

In the first case (only old-fashioned experiment annotation is set) the reference trajectory will be generated with a default tolerance that is 10 times tighter than the one specified by the experiment annotation. In the second case, it will be generated with the tigher tolerance specified by the new annotation.

At this point I guess having a separate verificationExperiment annotation is not really a good idea. Maybe we could just add three other fields to the existing experiment annotation, with appropriate defaults:

StopTimeVerification = StopTime to be used both for reference generation and verification simulations
ToleranceReference = 0.1*Tolerance to be used for reference generation
ToleranceVerification = Tolerance to be used for verification simulations

In the 95% non-problematic cases, we can keep the old-fashioned experiment annotations and rely on the default values both to generate the reference results and to run verification simulations. In those few cases where this causes problems, the library officer/developer can add those three values to the experiment annotation, to specify how long the verification simulation should be, what tolerance to use to generate the reference results, and what tolerance to use when generating verification results for the 0.2% CSV-compare test. The 5% CSV-compare tests will be run with the regulare experiment annotation Tolerance, but only until StopTimeVerification, because chaotic trajectories will eventually deviate as much as 100% or more, given enough time.

Is there some consensus on this approach? If so, I can open a new ticket where we can refine the details.

casella commented 2 months ago

Again a short call for clarification: What is considered to be the “reference data generating process” in the absence of an exact mathematical function (i.e., solution)?

The reference data are simulation results that the modeller judges to be good enough to be used as a reference, based on all information and knowledge available to him/her. Of course this requires some kind of trust that the used simulation tool is actually doing the right things. From this point of view, it would be nice for the modeller to be able to see what kind of reference results will be generated by different tools and compare them (See my proposal for an MA CI above). Although mathematical correctness cannot be judged with democratic criteria, if I generate such trajectories with 6 different tools and 5 are very close to each other while one isn't, there some strong clue (though not rigorous proof) that the outlier is due to a tool bug, while the other ones are all reliable. The modeller would then pick one among those five (which one doesn't really matter) as the reference. Of course if those five make sense from a physical point of view, one should never blindly accept the results of a simulation.

EDIT: Is it a “perfect” (i.e., reference) implementation which can produce arbitrarily precise (exact) data across all operating systems/implementation instances out there? (I had thought this to be an “expectation” value across a “bag of tools” …)

I wish I could rely on such an implementation, but I don't think we have one now, nor we'll have one in the foreseeable future, that can fully cover the spectrum of models that can be written in Modelica. So, see above for my pragmatic proposal.

casella commented 2 months ago

Instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.

This can be done in some cases, but not always; also, it may be non-trivial to implement. Anyway, in those cases where the modeller can and wants to do this, he or she can add suitable assert() statements to the model, while providing an empty list of variables for reference trajectory generation. This is already possible now, without the need to change the Modelica language specification.

henrikt-ma commented 2 months ago

At this point I guess having a separate verificationExperiment annotation is not really a good idea. Maybe we could just add three other fields to the existing experiment annotation, with appropriate defaults:

Again: Why not reuse TestCase?

maltelenz commented 2 months ago

The most recent proposal by @casella seems like a good direction.

We should use TestCase for these new values though, as I mentioned earlier and as @henrikt-ma reiterates in the previous comment.

henrikt-ma commented 2 months ago

StopTimeVerification = StopTime to be used both for reference generation and verification simulations

If placed inside the TestCase annotation, it could be called just StopTime. Alternatively, I'd suggest ReferenceStopTime as it is the stop time of the reference.

ToleranceReference = 0.1*Tolerance to be used for reference generation

The point of having a (Reference)ToleranceFactor = 0.1 instead is that if you override the default and then, much later, decide to use another Tolerance, you avoid the risk of suddenly having the same tolerance for reference generation as for verification, hence preserving the sanity check that Tolerance is reasonably set.

ToleranceVerification = Tolerance to be used for verification simulations

I think this is a bad idea, as it only opens up for having inadequate Tolerance set in models. Instead, there should be ways of adjusting comparison tolerances so that the Tolerance set in the model can pass verification even results are a bit (but not too much) sloppy for simulation performance reasons.

bilderbuchi commented 2 months ago

Instead of comparing trajectories, you could also assert that certain model invariants (e.g. energy in a system), or pre- or postconditions remain fulfilled.

This can be done in some cases, but not always; also, it may be non-trivial to implement. Anyway, in those cases where the modeller can and wants to do this, he or she can add suitable assert() statements to the model, while providing an empty list of variables for reference trajectory generation. This is already possible now, without the need to change the Modelica language specification.

Agreed! I merely wanted to point this alternative approach out as a way out of the "chaotic behaviour vs. reference trajectories" conundrum. Sorry for not making this clearer from the outset.

henrikt-ma commented 2 months ago

I also need to mention an elephant in the room before we fall too deeply in love with Tolerance and related properties such as ReferenceToleranceFactor:

… and the relative integration tolerance Tolerance for simulation experiments

That is, we are discussing the control of local error for some integration method (with adaptive step size) that hasn't been defined. Hence, the global error – which is what matters for verification of simulation results – will generally be bigger with increased number of steps taken. To avoid this arbitrary relation to the integration method chosen by the tool, it would make way more sense to set an ErrorGrowth goal for local error per step length, but I am afraid that our community is stuck with integration methods and step size control that don't support such mode of operation.

This is an area where I hope that the modern and flexible DifferentialEquations.jl and the active community around it could benefit us all; that if we just keep nagging @ChrisRackauckas about this opportunity to lead the way, they will eventually set a new standard for error control that the rest of us will have to catch up with.

HansOlsson commented 2 months ago

It is not that simple - tolerance can be local error per step or per step length (normalized or not).

However, looking at global error has two issues:

The propagation from local to global error is quite complicated for any realistic pure ODE system. With hybrid behaviour it is even worse.
In many other numerical algorithms a backward error analysis is used instead of the forward error usually meant by "global error". The backward error analysis seems like a closer match for the tolerance. Remember that all models are (partially) incorrect.

However, I don't see that such a discussion is relevant here.

henrikt-ma commented 2 months ago

However, I don't see that such a discussion is relevant here.

My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what Tolerance really applies to.

casella commented 2 months ago

My point is that we must have realistic expectations on how far we can get with cross-tool verification when we haven't even agreed upon what Tolerance really applies to.

@henrikt-ma my expectations are based on several years of experience trying to match Dymola-generated reference values with OpenModelica simulations on several libraries, most notably the MSL and Buildings, which contain a wide range of diverse models (mechanical, themal, electrical, thermo-fluid, etc.). The outcome of that experience is that the current set-up works nicely in 95% of the cases, but then we always get corner cases that need to be handled. If we don't, we end up with an improper dependence of the success of the verification process on the tool that actually generated the reference data, which is something we must absolutely get rid of, if we want the core statement "Modelica is a tool-independent modelling language" to actually mean something.

I am convinced that my latest proposal will enable to handle all these corner cases nicely. Of course this is only based on good judgement, I have no formal proof of that, so I may be wrong, but I guess we should try to make a step forward, as the current situation is not really tenable for MAP-Lib. If this doesn't work, we can always change it, the MLS is not etched in stone.

As to the meaning of Tolerance, the MLS Sect. 18.4 defines it as:

the default relative integration tolerance (Tolerance) for simulation experiments to be carried out which is a bit (I guess deliberately) vague, but widely understood as the relative tolerance on the local estimation error, as estimated by the used variable-step-size solver. In practice, I understand that parameter is just passed on to the relative tolerance parameter of integration routines.

The point of this parameter is not to be used quantitatively, but just to have a knob that you can turn to get more or less precise time integration of differential equations. In most practical cases, experience showed that 1e-4 was definitely too sloppy for cross-tool verification, 1e-6 gives satisfactory results in most cases, but in some cases you need to further tighen that by 1 to 3 orders of magnitude to avoid numerical errors play a too big role. That's it 😅.

casella commented 2 months ago

StopTimeVerification = StopTime to be used both for reference generation and verification simulations

If placed inside the TestCase annotation, it could be called just StopTime.

Sounds good. KISS 😃

ToleranceReference = 0.1*Tolerance to be used for reference generation

The point of having a (Reference)ToleranceFactor = 0.1 instead is that if you override the default and then, much later, decide to use another Tolerance, you avoid the risk of suddenly having the same tolerance for reference generation as for verification, hence preserving the sanity check that Tolerance is reasonably set.

I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?

That was the reason for specifying the tolerance for reference generation directly, instead of doing it with some factor. At the end of the day, I guess it doesn't matter much and it's mostly a matter of taste, in fact GUIs could take care of this aspect, like giving the number of intervals which is then translated into the Interval annotation by computing StopTime-StartTime/numberOfIntervals.

ToleranceVerification = Tolerance to be used for verification simulations

I think this is a bad idea, as it only opens up for having inadequate Tolerance set in models.

This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?

henrikt-ma commented 2 months ago

I understand the requirement, but I'm not sure this is a good solution in all cases. If the default Tolerance = 1e-6, it makes perfect sense to generate reference results with Tolerance =1e-7. But if we need to tighen it significantly, e.g. Tolerance = 1e-8, it may well be that Tolerance = 1e-9 is too tight and the simulation fails. Happened to me many times with numerically challenging thermo-fluid models. Per se, I don't think that the reference should necessarily be generated with a tigher tolerance than the verification simulation, if the latter is tight enough. Do I miss something?

Yes; if the library developer can't even make the simulation results match a reference generated with more strict tolerance, I would argue that the reference shouldn't be trusted. In my experience, it doesn't need to be 100 times smaller than Tolerance to serve as a sanity check, 10 times seems to be enough difference. I can also imagine that a factor of 0.01 might too often give solvers problems dealing with too tight tolerance, which is why I'm suggesting 0.1 to be the default. Before a default is decided upon, however, I suggest we try some different numbers to see in how many cases the default would need to be overridden; if we can manage that number with a default of 0.01, I'd be in favor of that since it gives us an even stronger indication of the quality of the reference results in the default cases.

I'd be very sceptic about using a ReferenceToleranceFactor above 0.1 for any test, as it would show that the test is overly sensitive to the Tolerance setting. For instance, if another tool uses a different solver, one couldn't expect that the Tolerance in the model would work.

henrikt-ma commented 2 months ago

ToleranceVerification = Tolerance to be used for verification simulations

I think this is a bad idea, as it only opens up for having inadequate Tolerance set in models.

This was @maltelenz's requirement. He wants to run one verification simulation with the Tolerance set in the experiment annotation. If the results fall within the 0.2% CSV-compare bounds (which is 95% of the time), the verification is passed. Otherwise, we need to override the default and try a stricter one. Do I miss something?

That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification, we ensure that verification always corresponds to what the user will actually see.

casella commented 2 months ago

For the record, here are some required tolerance changes for Buildings in order to avoid cross-tool verification issues: PR lbl-srg/modelica-buildings#3867

casella commented 2 months ago

That it only looks like a poor workaround for not being able to specify more relaxed comparison tolerance when one insists on prioritizing simulation speed over result quality. If we avoid introducing ToleranceVerification, we ensure that verification always corresponds to what the user will actually see.

The problem is, we don't only want to see that. We also want to make sure that if you tighten the tolerance, you get closer to the "right" result. Regardless of what the user experience is.