Closed simtheverse closed 11 months ago
HLA Time Management Settings: Recommendation 1: Configure all your federates to be both Time Constrained and Time Regulating. Two of your federates are not Time Constrained, which means that will not receive Timestamp Order (TSO) messages. This means the messages will be Receive Order (RO) even though they were sent TSO. It is possible that you could receive 0, 1, or 2 pieces of data per frame because data is RO.
For a deterministic and repeatable distributes simulation it is recommended to configure the federates to be Time Constrained and Time Regulating.
In your input file use the following settings: federate.set_time_regulating( True ) federate.set_time_constrained( True )
Trick Software (i.e. Realtime) Frame Size and Data Exchange Rate: I suspect you have a 5 millisecond Trick realtime frame time that corresponds to the 5 millisecond lookahead time you are using. Getting a Trick simulation to run realtime without overruns with 5 millisecond realtime frame will require you to isolate CPUs, lock the Trick sim to the isolated CPU, disabled interrupts, tune the OS, and more than likely have to install the Linux realtime kernel extensions.
Recommendation 2: Unless there is something that is forcing you to use a small Trick software frame time, you something more reasonable like 100 milliseconds. The Trick software frame must be an integer multiple of your lookahead time. Also, the Least Common Time Step will now likely be your software frame time so that you land on that time boundary when modding the federation to freeze (etc).
Only the federate with the Pacing Role should have the realtime clock enabled and you can override the Rame size in the input file or in the realtime.py file: exec(open( "Modified_data/trick/realtime.py" ).read()) trick.exec_set_software_frame( 0.100 )
federate.set_least_common_time_step( 0.100 )
Recommendation 3: Use a much larger lookahead time. The following distributed simulation architectures can help:
Typically, using shadow state combined with lag-compensation works very well.
SpaceFOM Roles: Just a reminder that within the SpaceFOM compliant federation, there can only be one Master, Pacing or Root Reference Frame roles. Given that the granted time for the Chaser is ahead of the MPR and Target federates, this seems to imply the Chaser is also a Master federate and all the other federates are late joiners because all federates are also Time Regulating.
Recommendation 4: Make sure your Master federate is configured to know about all the required federates needed for the execution by using something like this in the MPR federates input file: federate.add_known_fededrate( True, str(federate.federate.name) ) federate.add_known_fededrate( True, 'Chaser' ) federate.add_known_fededrate( True, 'Target' )
I did not test your configuration of some federates not being time conatrinsed, but I did do a 8+ hour test run with all federates configured as both Time Constrained and Time Regulating and I did not see any issues. I will try and make a test run like your HLA Time Management configuration when I can.
As usual, Dan gives a very thorough explanation with good recommendations. I would also add to check if your local system is running network security checks. We have observed that some network security scans will interfere with the Trick variable server and cause a Trick simulation to 'behave poorly'. In some cases, this will cause a simulation to completely freeze up. For some of our very long runs, we have had to ask our network security folks to temporarialy suspend scanning. When possible, we try to run on an isolated network that does not have network scanning.
Dan and Zack, Thank you very much for these in-depth and through guidance for checking my federation. It turns out that there was a weird error in the models of the federate sims and fixing it has now let me run longer than the target of 24 hours. I am going through the guidance now and checking my setup based off of what you have suggested. I don't have shadow states or lag compensation so I am going to start looking at that to perhaps reduce the transmission rate. Thanks again for the great guidance!
Happy to have helped. I have been working up some example classes in the jeod_examples branch of TrickHLA. They have some examples of Lag Compensation. I hope to merge those into the master branch soon. However, I keep running into little bugs here and there. ;-)
Hello! I'm interested in simulating a SpaceFOM scenario for approx 24hrs in real time. I seem to get ~7hrs before the execution effectively pauses. I'm sharing the CRC's table for time advancement grants from both executions:
In each case, one federate (happens to be last one to join) gets granted to some time ahead of the others, and perhaps when they get out of sync they get stuck. I was curious what kind of debugging do you recommend to get down to the heart of the issue?
Another data point is that the MPR is getting quite a few overruns. I'll be digging in more to see what jobs are overruning and can post it here when I figure it out.
Thank you for the guidance!