pramit11 commented 9 years ago

Summary

On running a federation execution with time regulating enabled on federates (using the example federate), if one of the federates crashes, execution doesn't grant time to other federates. In the ideal case, when a federate crash is detected, federation should resign the crashed federate, and continue the execution.

Following log is generated-

WARN [VERIFY_SUSPECT.TimerThread,SampFederation,xyzpc-12213] portico.lrc.jgroups: Detected that federate [1] may have crashed, investigating... INFO [Regular] portico.lrc.jgroups: Federate [crashedFederateName,2] disconnected, synthesizing resign message

On the code level, in FederationListener.java, resign message is being created for crashed federate but it is not being dispatched.

        // loop through each of the federates that disappeared and fake up a resign
        // action to give to the local LRC
        for( Integer federateHandle : disappeared.keySet() )
        {
            String federateName = disappeared.get( federateHandle );

            // synthesize a resign notification
            ResignFederation resign =
                new ResignFederation( JResignAction.DELETE_OBJECTS_AND_RELEASE_ATTRIBUTES );
            resign.setSourceFederate( federateHandle );
            resign.setFederateName( federateName );
            resign.setFederationName( federationName );
            resign.setImmediateProcessingFlag( true );

            Message resignMessage = new Message( channel.jchannel.getAddress(),
                                                 channel.jchannel.getAddress(),
                                                 MessageHelpers.deflate(resign) );

            logger.info( "Federate ["+federateName+","+federateHandle+
                         "] disconnected, synthesizing resign message" );
        }

Environment and Logs

HLA v1.3, 1516e
C++, JAVA
Linux

michaelrfraser commented 9 years ago

@pramit11 thanks for bringing this to our attention!

@timpokorny if this is just a matter of the resignMessage object not being sent via jgroups I'm more than happy to take a look at it.

timpokorny commented 9 years ago

@michaelrfraser I'm not 100% sure. The fake resign should be generated and processed locally only (not sent out as it would be in normal circumstances).

I have a funny feeling this is due to the way the suspect JGroups callback is handled. Off the top of my head I think this method is first called for a lost connection, and then later, when confirmed, a new View turns up that does not contain the lost connection any more (as a sort of confirmation). We may doing or not doing something in the suspect process. So it could be in code, it could be in the JGroups stack configuration (embedded in the jar - so it'll be in resources). A bit of quick JGroups reading required if you're up for it.

michaelrfraser commented 9 years ago

I've submitted a PR for this issue here https://github.com/openlvc/portico/pull/128

timpokorny commented 9 years ago

PR merged into master

openlvc / portico

Crashed federate doesn't resign from federation #126

Summary

Environment and Logs