hla.rti1516e.encoding.DecoderException and FOM modules

openlvc / portico

Portico is an open source, cross-platform, fully supported HLA RTI implementation. Designed with modularity and flexibility in mind, Portico is a production-grade RTI for the Simulation and Training Community, so come say hi!

http://www.porticoproject.org

153 stars 81 forks source link

hla.rti1516e.encoding.DecoderException and FOM modules #188

Open bergtwvd opened 8 years ago

bergtwvd commented 8 years ago

We have three federates A, B and C, where

all federates use the RPR-FOM (2.0)
federates B and C use a Sensor FOM module (not based on RPR)

And where

Federate A generates entity states
Federate B translates these entity states to tracks
Federate C displays tracks and entity states on a map

Federate A uses the RPR-FOM in the createFederationExecution and joinFederationExecution Federate B and C use the RPR-FOM and Sensor FOM in the createFederationExecution and joinFederationExecution

We see the following behaviour depending on start-up order of the federates:

If B or C start before A then all works fine
If A starts first we consistently get a DecoderException in federate C for one of the attributes in the track data (see below)

[0m hla.rti1516e.encoding.DecoderException: Only valid values for boolean are 0 and 1, found: 1090903400 [0m at org.portico.impl.hla1516e.types.encoding.HLA1516eBoolean.decode(HLA1516eBoolean.java:121)

The same test with the Pitch RTI runs fine.

There is the suspicion that the FOM merge has influence on this. If the federation execution is created with all possible FOM modules, exchange of data between federates appears to work fine. However, this situation is not desirable and also not according to the 1516 standard.

timpokorny commented 8 years ago

G'day @bergtwvd, that does distinctly sound like a merge issue. What should happen is that each federate does a pre-merge on join as a check, but it is the coordinator of the federation that commits that to the actual FOM when it grants their join request. It then updates all federates with the new object model representation.

I wonder if that update message is getting stuck in the queue and hasn't been processed prior to a call to something that uses the in-memory representation. Those calls are all fulfilled unilaterally by the local federate, so i wonder if the internal representation hasn't been updated (could be an async issue). I will test with a simplified FOM, but can you generate a log set to TRACE level and see what each of the federates is saying about the merge and join calls at the start?

Cheers, Tim

bergtwvd commented 8 years ago

Attached a zip with Portico TRACE level output.

The federates:

shipsim starts first (RPR-FOM2)
sensorsim starts second (RPR-FOM2, Sensor FOM)
kml starts third (RPR-FOM2, Sensor FOM)

Consistent decode exception in kml.

trace.zip

bergtwvd commented 8 years ago

Attached a zip with Portico TRACE level output, with a different start up order and where there is no problem.

The federates:

sensorsim starts first (RPR-FOM2, Sensor FOM)
shipsim starts second (RPR-FOM2)
kml starts third (RPR-FOM2, Sensor FOM)

No decode exceptions. Output by kml is OK.

trace2.zip

timpokorny commented 8 years ago

Received - thank-you very much. This was very useful to trace through. Looking at the generated logs and tracing through the code to see the various steps generating them I think I have broken something when adding the recent WAN support.

The join message sent to clients previously contained the full set of serialized FOM modules but now only contains the name of the joining federate - that's going to make it difficult for them to learn about it! I will start looking into a fix.

timpokorny commented 8 years ago

OK, so an update - I was wrong and have found the FOM merge section, which appears to be working as it should (from what I can tell). The merging happens post-join, but outside the join process, so it is somewhat hidden - not helpful! Regardless, it appears to be happening. I'm having a closer look at the decoder exception now.

timpokorny commented 8 years ago

This is strange. The exception I'm seeing in the logs is one that indicates that there is a buffer underflow:

[32mkml_1        |[0m TRACE [main] portico.lrc: CALLBACK reflectAttributeValues(object=2097153,attributes={1249(4b),1250(4b),1251(12b),1235(4b),1236(19b),1237(8b),1238(4b),1239(4b),1240(27b),1242(12b),1244(8b),1247(24b)}) (RO)
[32mkml_1        |[0m 1::13:33:07.213::kmlserver.FederateAmbassador::reflectAbsoluteTrackAttributeValues::null
[32mkml_1        |[0m hla.rti1516e.encoding.DecoderException: Insufficient space remaining in buffer to decode this value
[32mkml_1        |[0m     at org.portico.impl.hla1516e.types.encoding.HLA1516eFloat64BE.decode(HLA1516eFloat64BE.java:111)
[32mkml_1        |[0m     at org.portico.impl.hla1516e.types.encoding.HLA1516eFixedRecord.decode(HLA1516eFixedRecord.java:147)
[32mkml_1        |[0m     at org.portico.impl.hla1516e.types.encoding.HLA1516eFixedRecord.decode(HLA1516eFixedRecord.java:155)
[32mkml_1        |[0m     at kmlserver.sensorcodec.SensorCodec.decodePosition(SensorCodec.java:348)
[32mkml_1        |[0m     at kmlserver.FederateAmbassador.reflectAbsoluteTrackAttributeValues(FederateAmbassador.java:193)
[32mkml_1        |[0m     at kmlserver.FederateAmbassador.reflectAttributeValues(FederateAmbassador.java:139)
[32mkml_1        |[0m     at kmlserver.FederateAmbassador.reflectAttributeValues(FederateAmbassador.java:101)
[32mkml_1        |[0m     at org.portico.impl.hla1516e.handlers.ReflectAttributesCallbackHandler.process(ReflectAttributesCallbackHandler.java:102)
[32mkml_1        |[0m     at org.portico.utils.messaging.MessageChain.process(MessageChain.java:78)
[32mkml_1        |[0m     at org.portico.utils.messaging.MessageSink.process(MessageSink.java:187)
[32mkml_1        |[0m     at org.portico.lrc.LRC.tickProcess(LRC.java:688)
[32mkml_1        |[0m     at org.portico.lrc.LRC.tick(LRC.java:555)
[32mkml_1        |[0m     at org.portico.impl.hla1516e.Impl1516eHelper.evokeMultiple(Impl1516eHelper.java:188)
[32mkml_1        |[0m     at org.portico.impl.hla1516e.Rti1516eAmbassador.evokeMultipleCallbacks(Rti1516eAmbassador.java:5407)
[32mkml_1        |[0m     at kmlserver.Federate.start(Federate.java:449)
[32mkml_1        |[0m     at kmlserver.KMLServer.start(KMLServer.java:145)
[32mkml_1        |[0m     at kmlserver.Main.main(Main.java:84)

Why it works based on startup order I am finding difficult to picture.

There is no way to pull out data type information from the RTIambassador, so merging should be irrelevant here as there is no way to get information at that level back out.

Do you have any ideas about where the Pitch middleware stuff is pulling the information it needs from in order to construct its decoder? I would have presume this was pre-compiled.

To help try and shed some light here I've updated master with some additional information in those decoder exceptions about how much data was found (vice what was expected). This will hopefully show if there is a problem with the integrity of the data being provided in the reflection. Can I ask you to run again using the latest master? Do you want me to generate a build for you to make that a bit less of a burden?

bergtwvd commented 8 years ago

Tim, thanks for the effort so far. The federate uses the RTI interface directly, and does not have any Pitch Dev Studio generated code. I have attached the sensor codec class that is used for the encoding/decoding of tracks.

I am happy to run things again and gather output. Yes, please, if you can generate a build for this that would be great (my enviornment is not set up for running from the master branch).

Thanks!

SensorCodec.zip

timpokorny commented 8 years ago

No problems. Code received and it all looks fine to me. I added another test based on the exact code and it all passes. It is starting to smell like the data passed in the callback is somehow different.

Current master has the additional logging. I have tried to upload a new set of nightlies but I am away from home until tomorrow evening currently and my active upload speed is... 1 megabit. I can't even begin to describe how broken life is. I can get something uploaded, but it will have to wait until Friday evening by the looks of it :(

bergtwvd commented 8 years ago

This is fine. I will pick up the nightly over the weekend, whenever it is available.

anthonycramp commented 8 years ago

I can build from master later tonight and send you a jar if you want.

On Thursday, 6 October 2016, bergtwvd notifications@github.com wrote:

This is fine. I will pick up the nightly over the weekend, whenever it is available.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openlvc/portico/issues/188#issuecomment-251886104, or mute the thread https://github.com/notifications/unsubscribe-auth/AAtYrv8u2fsFJ0z4w0BTvW7o2AJwZaqGks5qxKGJgaJpZM4KE_lJ .

bergtwvd commented 8 years ago

Okay too. Just let me know.

bergtwvd commented 8 years ago

Attached the log when running with the new portico jar. Is there a way to confirm that I am indeed using the new jar?

In the GOOD one the start order is: KML - Ship - Sensor In the BAD one the start order is: Ship - KML - Sensor

trace2-GOOD.zip trace2-BAD.zip

timpokorny commented 8 years ago

Sorry for the delay. I walked back into the office after three weeks away - that went as you can predict.

Right, I've now had a chance to review the logs and I can see the same error you were originally highlighting (Boolean parse error). This is also different from the error in the first set of logs, and I think that provides us with the clue.

As a guess, I think you are right about the FOM merging. I suspect that there is a race condition that is causing the federates to have the correct unified hierarchy, but potentially with differing handle values for some of the attributes. This may explain why it is tripping up. It is getting all the attributes, but it is looking at wrong ones to contain certain information and fails to decode.

Currently, all the FOM merging is a peer activity. Rather than a central coordinator sending our a canonical FOM, each federate hears about the join messages and then merges the modules. I am guessing now that the second and third federates are joining closely together and merging is happening in the opposite order.

As a simple test for this, are you able to start the federates one by one with a bit of spacing between them rather than all at once? This would remove any condition and ensure a stable, unified state.

I would be interested to see if that actually fixed the problem. From there, I'd have to code in something that switched it to use the canonical approach rather than have it decentralized.

The only word of caution I have for that in the logging there is one attribute that is a difference size in the logs between the good and bad runs. If the above guess is correct, I would have expected both to report having the same say 11 attributes for an update with the same sizes, but one is different. Might be innocuous (a string that was just 1 char longer in the second run for example), or might be something else.

bergtwvd commented 8 years ago

I have rerun the tests, with 30 secs between sim starts. You can see the start order and sleep times at the start of the log. Unfortunately no difference :-(.

Attached the results.

test3-GOOD.zip test3-BAD.zip

timpokorny commented 7 years ago

@bergtwvd OK - Apologies for the delay, I have been on the road for almost five of the last six weeks :S

Two requests right at the bottom.

On the plane home yesterday I started to look again at this issue. Reviewing the history I think that the most likely explanation at this point is that there is a disconnect/disagreement between federates between about what handle relates to what attribute, perhaps due to the order things were merged in.

From Oct 5 logs I see an InsufficientSpaceException when unpacking an HLA1516eFloat64BE, then from other logs I see a failure to unpack HLAboolean (out-of-range). This appears variable which could be explained by the above.

However, in another log I see a null pointer exception along with the boolean out-of-range error (below) that is coming from the FederateAmbassador. It is possible that this may be related to some unpacking issue. This may be related to the same issue, or it might mean the root cause is something else.

ERROR [main] portico.lrc: FAILURE Exception encountered while processing incoming message: null
java.lang.NullPointerException
    at kmlserver.FederateAmbassador.reflectAbsoluteTrackAttributeValues(FederateAmbassador.java:307)
    at kmlserver.FederateAmbassador.reflectAttributeValues(FederateAmbassador.java:139)
    at kmlserver.FederateAmbassador.reflectAttributeValues(FederateAmbassador.java:101)
    at org.portico.impl.hla1516e.handlers.ReflectAttributesCallbackHandler.process(ReflectAttributesCallbackHandler.java:102)
    at org.portico.utils.messaging.MessageChain.process(MessageChain.java:78)
    at org.portico.utils.messaging.MessageSink.process(MessageSink.java:187)
    at org.portico.lrc.LRC.tickProcess(LRC.java:688)
    at org.portico.lrc.LRC.tick(LRC.java:555)
    at org.portico.impl.hla1516e.Impl1516eHelper.evokeMultiple(Impl1516eHelper.java:188)
    at org.portico.impl.hla1516e.Rti1516eAmbassador.evokeMultipleCallbacks(Rti1516eAmbassador.java:5407)
    at kmlserver.Federate.start(Federate.java:449)
    at kmlserver.KMLServer.start(KMLServer.java:145)
    at kmlserver.Main.main(Main.java:84)

I've fixed a couple of bugs that were preventing some FOM logging from working properly. I can not have the logs include attribute names rather than just handles, which will help in some of the logging areas (especially looking at the size of incoming reflects and which attributes have which numbers of bytes). I've updated it so that if the portico.logWithNames includes attributeClass in its list, it'll print the names rather than the handles. This will help with some more context.

Can I please get you to do the following:

If possible, can you link me up with the FederateAmbassador code so I can look at the null pointer issue?
Can you re-run the test with the following set in the RID file: i. portico.logWithNames=attributeClass ii. portico.fom.print=enabled iii. If possible, start the federates apart a little bit so they have time to log their FOMs without their output running over one another.

Thank-you for your extreme patience!

Cheers, Tim

bergtwvd commented 7 years ago

Tim, I decided to provide you access to the software in order to reproduce and debug the problem. I think this will be much more efficient. I will forward you the information in a separate email. Don't hesitate to mail me for more information.

Thanks, -- Tom

timpokorny commented 7 years ago

H'ok, so I can see the issue. When merging modules there is a chance that different federates will end up with different handles. This is causing the local federate to look for an attribute under a handle different to what it was sent with, resulting in ... chaos, or in this case, the wrong data and decoder exceptions.

As an example, from the bad setup, see the following dumped to the log from the Sensor and KML federates:

SENSOR

[sensor_1   |     -> (object class): Track (handle: 1233)
[sensor_1   |            (attribute): TrackDescription,      handle=1239, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackClassification,   handle=1242, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackStatus,           handle=1236, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackIdentification,   handle=1243, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackTimeOfInitiation, handle=1241, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackQuality,          handle=1238, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TargetObjectSet,       handle=1235, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackGenerator,        handle=1244, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TargetObjectSetValid,  handle=1234, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackTime,             handle=1237, order=TIMESTAMP, transport=RELIABLE, space=null
[sensor_1   |            (attribute): TrackNumber,           handle=1240, order=TIMESTAMP, transport=RELIABLE, space=null

KML

[kml_1      |     -> (object class): Track (handle: 1233)
[kml_1      |            (attribute): TrackDescription,      handle=1238, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackClassification,   handle=1244, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackStatus,           handle=1243, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackIdentification,   handle=1234, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackTimeOfInitiation, handle=1235, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackQuality,          handle=1241, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TargetObjectSet,       handle=1240, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackGenerator,        handle=1237, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TargetObjectSetValid,  handle=1239, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackTime,             handle=1236, order=TIMESTAMP, transport=RELIABLE, space=null
[kml_1      |            (attribute): TrackNumber,           handle=1242, order=TIMESTAMP, transport=RELIABLE, space=null

Module merging is meant to be deterministic, but clearly it is not. Will have to dive into the best way to fix this. Either fix the determinism problem (less invasive) or change the exchange so that merging is only ever done on the coordinator (which has other drawbacks, especially when big FOMs are used, or lots of federates join fast).

I shall return with more.