openlcb / OpenLCB_Java

Java implementation of OpenLCB / LCC protocols. This implementation operates inside JMRI, and can also be used inside Android apps.
6 stars 9 forks source link

Threading improvements #122

Closed pabender closed 6 years ago

pabender commented 6 years ago

Put most threads into a thread pool and allow those to be terminated.

This at least partly addresses #119. There are still a large number of threads created during JMRI testing that I can't terminate, most of them are timer threads.

@balazsracz If you get a chance, can you try this to make sure I didn't break something.

Once this is merged in, we'll need to make another release.

pabender commented 6 years ago

So I am trying to figure out why this compiles with ant but not maven.

pabender commented 6 years ago

Current status is that org.openlcb.implementations.BitProducerConsumerTest is hanging.

If I increase the number of threads in the thread pool, I can get it to run to completion, BUT the whole reason for introducing the thread pool was to make sure that we had an easy way to terminate threads.

What that means is that the tests here are NOT cleaning up after themselves well enough and they are contaminating the test space.

The only reason this worked previously is that every test created threads that were never cleaned up. That isn't something we can go back to. I'll have to look at this more tomorrow.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.9%) to 69.082% when pulling bc6b0c170cb89247060c9306e182bc2f3436923e on pabender:threadcontrol into 5c936212adf7421bbeeba25d10983ea08653fb55 on openlcb:master.

pabender commented 6 years ago

OK, tests are passing now (again)...

balazsracz commented 6 years ago

Hey @pabender thanks for working on this fix. I will have the chance to look at this code and test it in the next two days.

In the meantime, do you happen to have a list of the thread names that we are leaking in the test suite running? I think in one of the past efforts we have added thread names to everything that we ever start. Having the thread names would help us identify if there is a specific problem we could attack. For example I have been suspecting the DatagramMeteringBuffer to be leaking threads whenever a datagram is not responded okay. We could rewrite it to use a hashtable of state machines and not use threads at all.

thanks, Balazs

pabender commented 6 years ago

Sent from my iPad

On Mar 15, 2018, at 9:55 AM, Balazs Racz notifications@github.com wrote: In the meantime, do you happen to have a list of the thread names that we are leaking in the test suite running? I think in one of the past efforts we have added thread names to everything that we ever start. Having the thread names would help us identify if there is a specific problem we could attack. For example I have been suspecting the DatagramMeteringBuffer to be leaking threads whenever a datagram is not responded okay. We could rewrite it to use a hashtable of state machines and not use threads at all.

I have actually replaced most of the individual threads with a thread pool, which limits the total number of threads for each connection. Those are also easier to terminate because you can just shut down the whole thread pool.

Most of what we are leaking right now are timer threads. ( which I named in a PR last month ).

To give you an idea of the improvement, in the last week alone, I pulled the number of threads in the JMRI test suite down from over 2000 to about 900, with the code here in place.

Paul

pabender commented 6 years ago

@balazsracz the timers that are left running during the tests are: OpenLCB Memory Configuration Service Retry Timer OpenLCB NIDaAlgorithm Timer OpenLCB Mimic Node Store Timer

Paul

pabender commented 6 years ago

Sent from my iPad

On Mar 25, 2018, at 11:13 AM, Balazs Racz notifications@github.com wrote:

@balazsracz commented on this pull request.

In src/org/openlcb/OlcbInterface.java:

@@ -271,8 +282,23 @@ public void registerStartNotification(ConnectionListener c) { */ public void dispose(){ // shut down the thread pool

  • if(threadPool != null) {
  • threadPool.shutdownNow();
  • if(threadPool != null && !(threadPool.isShutdown())) {
  • // modified from the javadoc for ExecutorService
  • threadPool.shutdown(); // Disable new tasks from being submitted
  • try {
  • // Wait a while for existing tasks to terminate
  • if (!threadPool.awaitTermination(10, TimeUnit.SECONDS)) { I think this delay is causing problems. Here are timings from 'ant run' that runs the tests: Before your PR: BUILD SUCCESSFUL Total time: 8 seconds

After your PR: BUILD SUCCESSFUL Total time: 3 minutes 8 seconds

The problem is that as far as I can see shutdown() does not actively interrupt the tasks, and there are tasks that are waiting for interruptedexception in order to terminate. Therefore we will deterministically take 10 seconds to exit.

Maybe we should try shutdownNow() straight away. It is meaningless to wait until we have actually sent out some notification that threads are requested to terminate.

This is the recommendation procedure from oracle’s documentation. It gives the threads in the pool a chance to shutdown on their own before we force them to shutdown.

I realize this code creates a significant increases in the time to run these tests, but it also significantly increase the reliability of the shutdown process. In testing, I will take reliability over speed every time. The previous iteration of this just used shutdownNow, and left a large number of threads running.

In src/org/openlcb/implementations/DatagramMeteringBuffer.java:

@@ -267,8 +280,23 @@ public void handleDatagramRejected(DatagramRejectedMessage msg, Connection sende */ public void dispose(){ // shut down the thread pool

  • if(threadPool != null) {
  • threadPool.shutdownNow();
  • if(threadPool != null && !(threadPool.isShutdown())) { We need to add a check here that we are only shutting down threadpools that we created ourselves. We should not shut down a threadpool that was passed in.

The production code, we should only ever have one threadpool, and it should be shutdown before we get to this code. That is why the first thing we do is check to see if the threadpool is shut down.

This is primarily in place for unit testing, where we may test the class independently.

balazsracz commented 5 years ago

@pabender I have a problem with this PR. I am now unable to run the tests as part of the development process, because every single test function takes 20 seconds to run. Looking at the thread dump it is sitting in the await shutdown command. I am used to pressing a single button in my editor and running all the tests with it; that used to take about 10 seconds or so. Now it takes more than 10 minutes (and counting).

Please propose some way to ensure that we are not blocking the progress of tests with this shutdown method. This is not a reliability question, because the test has already passed or failed by the time it gets to the shutdown.

pabender commented 5 years ago

@balazsracz

While I understand that you are frustrated at the extended time this takes to run tests, the problem really is that the threads were never cleaned up properly in the first place, so your “running all, tests in 10 seconds” was a false sense of the amount of time the tests took ( and for the record, I have never been able to match anything approaching that ). On my development machine, which is older and slower, it takes about 10 minutes to compile and run the full test suite. The changes from this PR did not significantly impact that time.

It MIGHT be possible to setup the connection resources ( and therefore threads ) once per test file, at least for some files. That would take some time to implement, and it is a VERY low priority to for me.

I think the bottom line is that we need to find a way to reduce the number of threads we are using. This is how we are going to be more friendly to applications that use the OpenLCB library. JMRI issue https://github.com/JMRI/JMRI/issues/5665 is caused by this.

balazsracz commented 5 years ago

I don't quite understand how being nice to an application in thread count has anything to do with delaying shutdown in tests. An application almost never calls shutdown, therefore it is meaningless to it what the shutdown behavior is.

-- Sent from my mobile device

pabender commented 5 years ago

Sent from my iPhone

On Sep 26, 2018, at 7:12 PM, Balazs Racz notifications@github.com wrote:

I don't quite understand how being nice to an application in thread count has anything to do with delaying shutdown in tests.

Being nice to applications is not just about being nice at runtime. We also have to be nice to those applications when tests of those applications are running, even if the repeatedly create and destroy instances of objects in the library.

Remember that the whole reason I started down this path is that the thread count had gotten out of control in the JMRI tests because the threads produced by the OpenLCB library were not properly terminated and there was no mechanism to do that. It was bad enough in the JMRI tests that the OpenLCB threads, practically by themselves, were exhausting the available heap space.

The fact that our tests actually clean up after themselves is important to show the mechanism works. In other words it is part of the process of testing the library.

An application almost never calls shutdown, therefore it is meaningless to it what the shutdown behavior is.

At runtime yes, but not during testing.