nasa / TrickHLA

TrickHLA: An IEEE 1516 High Level Architecture (HLA) Simulation Interoperability Standard Implementation for Trick Base Simulations
Other
38 stars 15 forks source link

Running for 24hrs- it gets to about 7hrs. Time advancement grants issue? #131

Closed simtheverse closed 11 months ago

simtheverse commented 11 months ago

Hello! I'm interested in simulating a SpaceFOM scenario for approx 24hrs in real time. I seem to get ~7hrs before the execution effectively pauses. I'm sharing the CRC's table for time advancement grants from both executions:

image image

In each case, one federate (happens to be last one to join) gets granted to some time ahead of the others, and perhaps when they get out of sync they get stuck. I was curious what kind of debugging do you recommend to get down to the heart of the issue?

Another data point is that the MPR is getting quite a few overruns. I'll be digging in more to see what jobs are overruning and can post it here when I figure it out.

Thank you for the guidance!

dandexter commented 11 months ago

HLA Time Management Settings: Recommendation 1: Configure all your federates to be both Time Constrained and Time Regulating. Two of your federates are not Time Constrained, which means that will not receive Timestamp Order (TSO) messages. This means the messages will be Receive Order (RO) even though they were sent TSO. It is possible that you could receive 0, 1, or 2 pieces of data per frame because data is RO.

For a deterministic and repeatable distributes simulation it is recommended to configure the federates to be Time Constrained and Time Regulating.

In your input file use the following settings: federate.set_time_regulating( True ) federate.set_time_constrained( True )

Trick Software (i.e. Realtime) Frame Size and Data Exchange Rate: I suspect you have a 5 millisecond Trick realtime frame time that corresponds to the 5 millisecond lookahead time you are using. Getting a Trick simulation to run realtime without overruns with 5 millisecond realtime frame will require you to isolate CPUs, lock the Trick sim to the isolated CPU, disabled interrupts, tune the OS, and more than likely have to install the Linux realtime kernel extensions.

Recommendation 2: Unless there is something that is forcing you to use a small Trick software frame time, you something more reasonable like 100 milliseconds. The Trick software frame must be an integer multiple of your lookahead time. Also, the Least Common Time Step will now likely be your software frame time so that you land on that time boundary when modding the federation to freeze (etc).

Only the federate with the Pacing Role should have the realtime clock enabled and you can override the Rame size in the input file or in the realtime.py file: exec(open( "Modified_data/trick/realtime.py" ).read()) trick.exec_set_software_frame( 0.100 )

federate.set_least_common_time_step( 0.100 )

Recommendation 3: Use a much larger lookahead time. The following distributed simulation architectures can help:

  1. Use shadow states that are integrated at the dynamics rate with HLA data representing truth state updates. For example, the Chaser federate maintains a dynamics state for the Target that it updates locally at the dynamics rate and will override the local state with the Target HLA data that gets reflected to it at some slower rate.
  2. Because all HLA TSO data will be for a previous time, you will need to compensate for the Latency (i.e. Lag) in the data. Zack has been working on examples that show his in the jeod_examples branch. Basically, you use forces, moments, and torques to compensate the stale data to the Current scenario time to update the vehicle position and attitude. The scenario and data exchange rate (i.e. lookahead) will help you determine how much error you can live with, but Lag-Compensation will greatly reduce the error for a lot of situation.
  3. If you don't want to do Lag-Compensation, you can always crank up the data exchange rate to minimize the amount of latency in the data that is determined by the lookahead time.
  4. Ultimately if you have very stringent position and attitude error requirements you may have to step up to using Zero-Lookahead data to achieve intra-frame data exchanges. However this technique will end up serializing the dynamics between the participating and you must layout exactly when all the exchanges take place otherwise you get deadlock.

Typically, using shadow state combined with lag-compensation works very well.

SpaceFOM Roles: Just a reminder that within the SpaceFOM compliant federation, there can only be one Master, Pacing or Root Reference Frame roles. Given that the granted time for the Chaser is ahead of the MPR and Target federates, this seems to imply the Chaser is also a Master federate and all the other federates are late joiners because all federates are also Time Regulating.

Recommendation 4: Make sure your Master federate is configured to know about all the required federates needed for the execution by using something like this in the MPR federates input file: federate.add_known_fededrate( True, str(federate.federate.name) ) federate.add_known_fededrate( True, 'Chaser' ) federate.add_known_fededrate( True, 'Target' )

dandexter commented 11 months ago

I did not test your configuration of some federates not being time conatrinsed, but I did do a 8+ hour test run with all federates configured as both Time Constrained and Time Regulating and I did not see any issues. I will try and make a test run like your HLA Time Management configuration when I can.

ezcrues commented 11 months ago

As usual, Dan gives a very thorough explanation with good recommendations. I would also add to check if your local system is running network security checks. We have observed that some network security scans will interfere with the Trick variable server and cause a Trick simulation to 'behave poorly'. In some cases, this will cause a simulation to completely freeze up. For some of our very long runs, we have had to ask our network security folks to temporarialy suspend scanning. When possible, we try to run on an isolated network that does not have network scanning.

simtheverse commented 11 months ago

Dan and Zack, Thank you very much for these in-depth and through guidance for checking my federation. It turns out that there was a weird error in the models of the federate sims and fixing it has now let me run longer than the target of 24 hours. I am going through the guidance now and checking my setup based off of what you have suggested. I don't have shadow states or lag compensation so I am going to start looking at that to perhaps reduce the transmission rate. Thanks again for the great guidance!

ezcrues commented 11 months ago

Happy to have helped. I have been working up some example classes in the jeod_examples branch of TrickHLA. They have some examples of Lag Compensation. I hope to merge those into the master branch soon. However, I keep running into little bugs here and there. ;-)