modelica / ModelicaSpecification

Specification of the Modelica Language
https://specification.modelica.org
Creative Commons Attribution Share Alike 4.0 International
95 stars 41 forks source link

Define a verificationExperiment annotation for experiments meant to compare results for regression testing and cross-tool comparisons #3473

Open casella opened 4 months ago

casella commented 4 months ago

Modelica models include an experiment annotation that defines the time span, tolerance and communication interval for a default simulation of the system. Usually, these parameters are set in order to get meaningful results from the point of view of the modeller. Since in many cases the models are affected by significant parametric uncertainty and modelling assumptions/approximations, it typically makes little sense to seek very high precision, say rtol = 1e-8, resulting in longer simulation times, when the results are affected by maybe 1% or more error.

We are currently using these values also to generate reference results and to run simulation whose result are compared to them, both for regression testing and for cross-tool testing. This is unfortunately not a good idea, mainly for two reasons:

  1. in some cases, the numerical errors exceed the tolerance of the CSV-compare tool, so we get a lot of false negatives because different tools, or different versions of the same tool, or even the same tool on different hw/ws platforms (see, e.g., OpenModelica/OpenModelica#11935) lead to different numerical errors/approximations;
  2. some other cases feature chaotic motion (e.g. the Furuta Pendulum or three-body problems) or large amounts of closely-spaced state events (e.g. all kind of switched circuit models) whose triggering time inevitably tend to drift apart due to accumulating errors in determining the exact switching times.

For testing, what we need is to select simulation parameters which somehow guarantee that the numerical solution obtained is so close to the exact one that numerical errors cannot lead to false negatives, so that a verification failure really means something has changed with the model or the way it was handled to generate simulation code.

In both cases, what we need is to clearly differentiate between the role of the experiment annotation, which is to produce meaningful results for the end-user of the model, and of some new annotation, which is meant to produce near-exact results for comparisons.

For case 1., what one typically needs is to provide a tighter tolerance, and possibly a shorter communication interval. How much tighter and shorter, it depends on the specific case, and possibly also on the set of tools involved in the comparison - there is no one-fits-all number.

For case 2., it is necessary to choose a much shorter simulation interval (smaller StopTime), so that the effects of chaotic motion or the accumulated drift of event times don't have enough time to unfurl significantly. Again, how much shorter, it depends on the participant to the game, and may require some adaptation.

For this purpose, I would propose to introduce the verificationExperiment annotation, with exactly the same arguments as the experiment annotation, to be used for the purpose of generating results for verification. Of course, if some arguments (e.g. StartTime) are not supplied, or if the annotation is outright missing, the corresponding values of experiment annotation (or their defaults) will be used instead.

HansOlsson commented 4 months ago

I agree that such information is useful, and ideally it should only be needed for a fraction of the models.

I believe that what we have used for Dymola is instead of modifying the package that information is available externally - and also to loosen the tolerance so that we don't get false positives; but I will check.

As I can see that has a number of advantages:

henrikt-ma commented 4 months ago

I'm not 100% happy with this direction, as it means that the default will be to use the settings from the experiment-annotation, meaning that we don't automatically detect that decreasing tolerance by an order of magnitude has significant impact on the result.

I see good reasons to generate reference results according to a procedure where tolerance is automatically set an order of magnitude lower than that in the experiment, so that CI correctness tests running with the experiment settings can detect when the experiment is inappropriate.

I can imagine there could rare situations when the automatic settings for reference result generation need to be overridden, but I have yet to see an actual example of this. Similar to what @HansOlsson's says regarding Dymola, we at Wolfram also keep information for overriding the experiment in files that are separate from the Modelica code, with the ability to both control reference result generation and settings for running tests (used in rare situations as an alternative to using very sloppy comparison tolerances when the experiment settings are considered really poor).

A related topic is the use of different StopTime for regression testing and for model demonstration, but this might be better handled with a TestStopTime inside the normal experiment.

christoff-buerger commented 4 months ago

I honestly would not introduce any new annotations etc to the language. In my opinion Modelica already has all the means one needs to achieve this, e.g., inheritance and modification. What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

For example, you can have basic user experiments in your library (i.e., examples) and in a separate regression test library refine these experiments with different setting (change tolerances, stop times whatever). In this setup, the regression testing setup is not part of the final library distribution, as it likely should be (because you might add a bunch of scripts and other software for a continuous integration support resulting in huge dependencies to make your automatic tests run). You just don't want to ship all of that. And as @henrikt-ma mentions, good pratice at the tool vendors is already to have separate settings for this.

I think, testing is a matter of project policies and templates. MSL can come up with a good scheme for its use case and improve its infrastructure without any need for language changes.

casella commented 4 months ago

I respectfully disagree with @henrikt-ma and @christoff-buerger:

Regarding this comment

What has to be done by each individual project is to decide for ITS DOMAIN AND USE-CASES on the structure and templates of modelling all the simulation applications.

it is true that each MA project has its own responsibility, but the whole point of the MA is to provide coordinated standards and libraries. In this case, I clearly see the need for a small language change to support the work by MAP-Lib. BTW, we are talking about introducing one new annotation, which is no big deal. We could actually introduce a vendor annotation __MAP_Lib_verificationExperiment, but this really seems overkill to me, can't we just have it as a standard Modelica annotation?

@GallLeo, @beutlich, @dietmarw, @AHaumer, @hubertus65, I'd like to hear your opinion.

GallLeo commented 4 months ago

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I had a look at the "heuristics" proposed 10 years ago, when I started creating the first set of reference results: https://raw.githubusercontent.com/wiki/modelica/ModelicaStandardLibrary/MSLRegressionTesting.pdf#page=11 Takeaway: These rules worked quite fine for most test cases (at least much better as if we had used tolerance 1e-4 everywhere). Back then, I was not able to use a tight tolerance everywhere, because some models would only simulate with their default tolerance. Are these models to be considered "bad models"? Or should we be able to test any working model?

After the 4.0.0 release, @beutlich documented his adaption of tolerances in the MSL wiki: https://github.com/modelica/ModelicaStandardLibrary/wiki/Regression-testing-of-Modelica-and-ModelicaTest Takeaway: Finding the "right" strict tolerance for MSL examples still needs manual work.

So, in order to make it transparent, I'm in favor of explicitly specifying verification settings. But, then we have to limit the manual work for setting these up as much as possible: If a example/test case is changed, should the verification setting be updated automatically? Example: If you change stop time of an example from 1000 s to 2000 s, interesting dynamics could be missed, if you keep your verification stop time ate 1000 s.

Where to specifiy the explicit verfication settings?

@christoff-buerger proposed using extends. We could do that in ModelicaTest, without manually writing a new library: ModelicaTest.Modelica...MyExample extends Modelica...MyExample annotation(experiment(...)); Benefit: All MSL-test cases would reside in ModelicaTest (no need to run two libraries) Drawback: CI or library developer has to generate the extends-test-case in ModelicaTest. If the test case for a new example is not generated, it will be missed in regression runs. If the experiment setup of an example is changed, it will not be automatically updated in the test case.

@HansOlsson proposed storing the verifications setup externally. This might work, as long as you test in one Modelica tool and this Modelica tool cares about the verification setup. I'm unsure, how to handle it in multi-tool environments. We already have the burden of updating Resources/Reference/Modelica/.../MyExample/comparisonSignals.txt

Seperate files are very hard to keep up to date, therefore storing verification settings in the example/test case seems the be right place to me. Especially, if we think about using signals from figures annotation as important comparison signals.

I would add arguments to experiment instead of new verificationExperiment. Reasons:

beutlich commented 4 months ago

I am pretty much aligned with @GallLeo here.

Example models

Using example models simulation results as reference results for cross-tool, cross-solver or cross-MSL-version regressing testing can be considered as a misuse. In strict consequence there should be no directory in Modelica/Resources/Reference/ at all.

Reference models

Reference models should be taken from ModelicaTest. If example models from Modelica shall be also considered for regression testing, it might be an option to extend from it and adopt the experiment settings. This would also simplify the job of the regression tester since only one library needs to be taken into account.

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

  1. Specify reference signals outside the model in ModelicaTest/Resources/Reference/ModelicaTest/ and take solver settings from the experiment annotation. This is status-quo (for models in ModelicaTest). It is error-prone in the sense that signal files can be missing or wrong (duplicated or invalid signal names).
  2. Specify both reference signals and solver settings in the model. Yes, that could distract MSL users, even if only in ModelicaTest.
  3. Specify both reference signals and solver settings outside the model. The only requirement we have is that it should be a machine and human redable text-based file format (say TXT (as now for the comparisonSignals.txt) or some more structured YAML with comments and syntax highlighting). The main advantage I see is, that I do not need a Modelica parser or editor to obtain and modify these settings, it's all there in a specified file/directory location which can be controlled from the test runner engine and easily format-tested by linters/checkers. Of course, it again is error-prone in the sense of missing files or wrong signals. (You might compare it to the Modelica localization feature where we also keep the localization outside the library iself in a dedicated directory location.)

Summary

I am in favour to keep reference models outside Modelica and only use ModelicaTest for it. I am in favour to not have reference signals and experiment settings distributed in various locations, i.e., option 2 or 3 are my preferences. I even see more advantage in option 3. In that case it is left to specify (if at all)

I still need to be convinced that we need to have this in the specification or if it is simply up the the library developers of ModelicaTest. (Remember, it is now all in ModelicaTest and not any more in the MSL.)

dietmarw commented 4 months ago

I would also be in favour of having the verification settings separate from the simulation model but referenced from within the simulation model. So basically @beutlich's option 3.

HansOlsson commented 4 months ago

First of all: We need something for MSL, but the standardized result should be easy to adopt by all library developers. With "library developers" I don't mean the few public libraries which already have their own testing setup, but the hundreds of company-internal libraries, where part-time developers (engineers) hear about regression testing for the firs time. As soon, as they want to give their library to a colleague with a different Modelica tool, they enter the inscrutable world of cross-tool-testing. For MSL testing, many implicit rules are used. They will not be adopted by independet library developers.

I agree that many of those rules will not be adopted by all libraries.

Additionally to me use-cases such as this is one of the reasons we want the possibility to have the information completely outside of the library; so that the user of the library can add the things needed for testing - without changing the up-stream-library.

AHaumer commented 4 months ago

IMHO @casella is right with his analysis and his idea. I'm not fond of storing additional to the model the comparisonSignals.txt and separate settings for comparison / regression tests. It is much better to store this information (stop time, interval length, tolerance, what else?) within the model, either a second annotation or as part of the experiment annotation. If not present, the setting of the experiment annotation are valid (in most cases). Additionally we could specify the tolerance for comparison. The model developer decides whether changes for comparison / regression tests are necessary or not, maybe after receiving a message that comparison / regression tests are problematic (could be from tool vendors). As this annotation would be supported by most tools, this could be adopted by all third-party libraries. Regarding test examples: I think it's a good idea to use all "normal" examples that are intended to demonstrate "normal" usage, and additionally "weird" examples from Modelica.Test.

maltelenz commented 4 months ago

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

Except for extremely rare cases like the chaotic system, I don't want testing to happen with special settings for test runs. If you do that, you are no longer testing what the user will experience, and models might not even simulate in the end product (with the settings in the experiment annotation). Nobody would notice, if the tests run with some special configuration.

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I also want to make people aware of the TestCase annotation we already have, which is a candidate for a place to add test related things.

HansOlsson commented 4 months ago

One thing that is unclear to me in most of the discussions above is what pieces are talking about settings for reference data creation, and what pieces are talking about settings for test runs.

I'm a big proponent of "test what you ship", in the sense that one should (at least) have tests that do what a user would do. In this case it means running the examples with the settings from the experiment annotation. I can see that there are cases where this is not possible, like the chaotic systems already mentioned. In this case, a shorter stop time for result comparison is fine.

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:

Note that there can be other reasons than chaos for separating testing from running, e.g.:

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.) I don't know why the StopTime is 1.01 s instead of 1 s.

maltelenz commented 4 months ago

I agree with testing what you ship (as part of testing; you can add more) and and even this case involves two different things:

* Running the chaotic model (should work to the actual end-point); we don't want the chaotic model to just crash

* Comparing it with the reference (will be outside the bounds if using original end-point)

Thank you for clarifying this. I agree the test (in a perfect world) in this case should include both these steps.

Modelica.Mechanics.MultiBody.Examples.Loops.EngineV6 (and _analytic) says:

Simulate for 3 s with about 50000 output intervals, and plot the variables engineSpeed_rpm, engineTorque, and filteredEngineTorque. Note, the result file has a size of about 300 Mbyte in this case. The default setting of StopTime = 1.01 s (with the default setting of the tool for the number of output points), in order that (automatic) regression testing does not have to cope with a large result file.

(Note that the experiment annotation doesn't have number of intervals, but Interval=6e-5 corresponds to the above, and even 1e-4 will sort of work as well; but not much higher - since that will under-sample the envelope too much leading to weird effects.)

As a user of the model, I would not want to see the weird under-sampling of the engineTorque either. How am I supposed to know if it is the actual behavior?

The file size issue for testing can be dealt with by the tool only storing the variables you need for the comparison. For the user, if we introduce figures in the model, tools could be smarter there as well and only store (by default) what you need for plotting (and animation).

henrikt-ma commented 4 months ago

I also agree that we should not only specify the solver settings but also the valid reference signals. I see multiple options.

At Wolfram, where we have somewhat extensive use of figures in the models, we have the default policy to simply use all variables appearing in figures as comparison signals (@GallLeo already briefly mentioned this idea above). It has several advantages:

One small limitation of the approach is that there is no way to say that a variable in a figure should be excluded from correctness testing. I'm not aware of any model where we have needed this, but still… As @maltelenz suggested, the TestCase-annotation could be the perfect place for this sort of information. For example:

annotation(TestCase(excludeVariables = {my.first.var, my.arr[2].y}));
henrikt-ma commented 4 months ago

I definitely see the use of special settings for reference data creation. One could use that to give a tighter tolerance, resulting in reference data closer to the "perfect" result, increasing the chances of test runs with different tools/platforms/compilers getting close to the reference.

I think it is well understood that there isn't one Tolerance that will work for all models. What I'm afraid is less well understood is that it is very hard to detect this when reference results are generated with the same tolerance which is used when running a test. To avoid this, I'd like something like the following:

annotation(TestCase(experiment(toleranceFactor = 0.01)));

The default for toleranceFactor should something like 0.1, and I would expect that there would rarely be a reason to override the default; when there is a correctness error due to tolerance settings, one would fix this by adjusting the Tolerance, not by using a toleranceFactor closer to 1.

mwetter commented 3 weeks ago

I am also in favor of adding a new annotation, or an entry to the experiment annotation, that allows specifying data needed to produce result files for cross comparison. Our current setup is to simply increase the tolerance by a factor of 10, but such a global policy turns out to be unsatisfactory. In my opinion, it is the modelers responsibility to add this information. Most of our contributors are not "professional" developers but mainly researchers or student who use the library and occasionally develop contributions so asking them to put all information inside the .mo file would be preferred.

Therefore, having model-specific specifications for the CI tests would be valuable. These specification should include

Currently we have (some of) these information spread across different files for about 1700 test models. This is due to historical reasons. It would be good to refactor this as the current situation makes it hard for new developers to figure out what is to be specified where, and it makes it harder to explain how to port changes across maintenance branches. Having these inside the .mo annotation is my preferred way.