openlcb / OpenLCB_Java

Java implementation of OpenLCB / LCC protocols. This implementation operates inside JMRI, and can also be used inside Android apps.
6 stars 9 forks source link

Datagram buffer state machine problem #128

Open balazsracz opened 6 years ago

balazsracz commented 6 years ago

In certain situations the datagram handler in the openlcb library locks up in a way that it becomes impossible to send datagrams anymore. A good way to trigger this is to have a target node that does not send back ack or nack for the datagram (but presumably it times out). I saw the following on the console:

2018-05-10 17:53:17,877 ptionhandler.UncaughtExceptionHandler ERROR - Uncaught Exception caught by jmri.util.exceptionhandler.UncaughtExceptionHandler [Olcb-Pool-1-Thread-22] java.lang.IllegalStateException: Timer already cancelled. at java.util.Timer.sched(Timer.java:397) at java.util.Timer.schedule(Timer.java:193) at org.openlcb.implementations.DatagramMeteringBuffer$MessageMemo.startTimeout(DatagramMeteringBuffer.java:213) at org.openlcb.implementations.DatagramMeteringBuffer$MessageMemo.forwardDownstream(DatagramMeteringBuffer.java:203) at org.openlcb.implementations.DatagramMeteringBuffer$MessageMemo.sendIt(DatagramMeteringBuffer.java:199) at org.openlcb.implementations.DatagramMeteringBuffer$Consumer.consume(DatagramMeteringBuffer.java:338) at org.openlcb.implementations.DatagramMeteringBuffer$Consumer.run(DatagramMeteringBuffer.java:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2018-05-10 18:03:15,565 implementations.DatagramService ERROR - Overriding datagram transmit memo. old DatagramServiceTransmitMemo to 06.01.00.00.00.04: 20.00.7F.00.00.00.F8.00.00.00.01 new DatagramServiceTransmitMemo to 06.01.00.00.00.04: 20.40.7F.00.00.00.F8.04 [AWT-EventQueue-0]

pabender commented 6 years ago

@balazsracz

I'm trying to figure out what would cause this.

The only thing that SHOULD cause this is canceling the timer in the datagram metering buffer.

The datagram metering buffer cancels it's timer when dispose is called on the object.

The only place dispose is called on a datagram metering buffer is when dispose is called on an OlcbInterface object. OlcbInterface is the only place where a DatagramMeteringBuffer object is created.

So, I can conclude from that that the error message you refer to above is indicating that something tried to send a message to the OlcbInterface associated with the connection, but that OlcbInterface has been closed. This should not happen during normal operation (JMRI never actually closes the interface during normal operation, except possibly when the application is closing via a shutdown hook).

We can suppress the error by catching it, but that won't fix a threading issue if there really is one.

I'm afraid I need more information to do further debugging. A step by step testing procedure would be a good start.

balazsracz commented 1 year ago

I see this situation quite a lot during development, and I associate it with a node disappearing from the network, then JMRI trying to talk to it using datagrams. Something like this: