Extremely slow execution of commands

cdjackson commented 5 years ago

In heavily loaded systems, the Thing.scheduler can add huge delays to command execution. It probably needs to be looked at removing this and simply using a thread.

With the following code (slightly cut down to primarily just show the debug) -:

    @Override
    public void handleCommand(final ChannelUID channelUID, final Command command) {
        logger.debug("{}: Command for channel {} --> {} [{}]", nodeIeeeAddress, channelUID, command,
                command.getClass().getSimpleName());

        Runnable commandHandler = new Runnable() {
            @Override
            public void run() {
                logger.debug("{}: Sending command to channel {} --> {}", nodeIeeeAddress, channelUID, command);
                try {
                    if (command == RefreshType.REFRESH) {
                        handler.handleRefresh();
                    } else {
                        handler.handleCommand(command);
                    }
                } catch (Exception e) {
                    logger.debug("{}: Exception sending command to channel {}", nodeIeeeAddress, channelUID, e);
                }
            }
        };
        try {
            scheduler.schedule(commandHandler, 0, TimeUnit.MILLISECONDS);
            logger.debug("{}: Scheduled sending command to channel {} --> {}", nodeIeeeAddress, channelUID, command);
        } catch (Exception e) {
            logger.debug("{}: Exception sending command to channel {}", nodeIeeeAddress, channelUID, e);
        }
    }

We have the following -:

16-Feb-2019 13:51:11.573 [DEBUG] [.openhab.binding.zigbee.handler.ZigBeeThingHandler] - 001788010017CA08: Command for channel zigbee:device:19b2af56:001788010017ca08:001788010017CA08_11_color --> ON [OnOffType]
16-Feb-2019 13:51:11.574 [DEBUG] [.openhab.binding.zigbee.handler.ZigBeeThingHandler] - 001788010017CA08: Scheduled sending command to channel zigbee:device:19b2af56:001788010017ca08:001788010017CA08_11_color --> ON
16-Feb-2019 13:55:27.384 [DEBUG] [.openhab.binding.zigbee.handler.ZigBeeThingHandler] - 001788010017CA08: Sending command to channel zigbee:device:19b2af56:001788010017ca08:001788010017CA08_11_color --> ON

That's 4.5 minutes to run the command - not especially acceptable ;)

This is running on an RPi3 with 19 bindings running - CPU usage is average below 20% and RAM is at 50%. (note it's not my system, but someone I've been helping to debug).

hsudbrock commented 5 years ago

Do you have an idea why the Thing.scheduler adds such huge delays?

cdjackson commented 5 years ago

No - none at all, but it was clear that the delays are enormous.

The user was adamant that their system was not very heavily loaded, but as this scheduler is used by all bindings, and he has 19, I'm assuming that something somewhere is heavily loaded.

My thought at the moment is that a thread probably should be used for these time critical needs - or - we could use a local scheduler (which might be better).

cdjackson commented 5 years ago

@hsudbrock in #393 we use the following -:

    private ExecutorService commandScheduler = ThreadPoolManager.getPool("zigbee-thinghandler-commands");

Do you know what actually gets created if the scheduler doesn't exist?

I've been chasing problems with commands for the past few nights and I'm wondering if the problem is here. I see the commands come in to the binding, but they are then slow to be passed to the transaction manager (I'm still trying to confirm that's the case, but it's my suspicion).

All the commands (let's say 30 commands from a group!) arrive at the same time in the log, but they don't get queued in the transaction manager at this time - some get queued immediately, and others get queued later in slower time.

We have 3 queues and schedulers, but all the ones in the ZSS library I've now increased to 30 threads, and it made no difference, so I'm thinking the problem might still be in the binding.

hsudbrock commented 5 years ago

This creates a QueueingThreadPoolExecutor (link) with a default pool size of 5 (unless configured otherwise; the configuration can differ in each setup, as it is configurable via the OSGi config admin).

According to the Javadoc of QueueingThreadPoolExecutor, new tasks are queued until a thread gets free; the threads adding the commands to the transaction manager queue should finish fast, right? Then this does not explain what you observe, as a new thread should be available immediately after one command was queued.

To check your assumption that this thread pool is the culprit, you could change to ThreadPoolManager.getScheduledPool("zigbee-thinghandler-commands") and see if the issue goes away (as this would be just like what was done before the BaseThingHandler, just with a ZigBee-binding-specific pool). Wdyt?

cdjackson commented 5 years ago

Thanks Henning.

Then this does not explain what you observe, as a new thread should be available immediately after one command was queued.

Well, I don't think that's correct, and I think this is the source of the problem. While the queue manager will queue quickly and return, the converter, it will block waiting for the response from the transaction, so the thread will not complete until the response is received.

Eg - the color converter -:

        if (on) {
            clusterOnOff.onCommand().get();
        } else {
            clusterOnOff.offCommand().get();
        }

So this thread will not complete until the transaction response is received - that might be a number of seconds later (up to 30 with Ember dongles!). So after 5 commands are queued, the next command thread will not run until the first has completed.

I think this needs to be resolved somehow. I'm not sure it's 100% simple either. We could "simply" state that all converters need to return immediately - this might be fine when sending a simple On/Off command as we just don't worry about the .get(), but in some converters it may be more difficult where the converter may need to send multiple commands (color converter is an example as you can't set all HSB components in one command). We could again just fire the commands down into the stack and have the converter return immediately and hope that there are no ordering issues etc, but that doesn't really sound nice either.

So... I think that the queueing here is probably the root of the problem if it's only allowing 5 threads to execute - the question is how to resolve it. The problem I'm trying to solve is people have groups with 30 lights - I can increase the Ember queueing to allow 30 unicasts to be outstanding at once, and I can increase the transaction manager to allow 30 frames to be sent to the dongle, but none of this helps if the binding doesn't feed the system in the first place, then it all hangs at source.

WDYT?

1) Do we simply increase the number of threads. This will help, but can never guarantee it will solve all issues (do we cater for people with 100 lights?). 2) Do we change the converters and state that they need to return immediately? I'm not super keen on this as a blanket statement, although where possible, we probably should design converters to do this as it will at least help the issue. 3) ???

hsudbrock commented 5 years ago

Good point, Chris. Let me also think about this, I will comment here again.

hsudbrock commented 5 years ago

Just for the record here, results from a first test:

I tested with (a) ESH's ThreadPoolManager.getPool(...) , (b) ESHs ThreadPoolManager.getSchedulerPool(...), and (c) Executors.newCachedThreadPool(), where (b) is the variant that was implemented before the change in #393.

(a) and (b) both have the issue that 5 longer-running tasks will block the pool, inhibiting further tasks to be executed before a longer-running task has finished. With (c), one new thread was created for each longer-running task, so no blocking as for (a) and (b), but many threads were created which might be an issue for not-so-powerful hardware.

cdjackson commented 5 years ago

Thanks Henning - I guess this is no surprise :)

As I think we discussed previously, even with c) while many threads are created, they are re-used and the pool will shrink again after 60 seconds. You could alternately use a fixed size pool which is relatively large (eg 30) - that would eliminate the uncertainty at least.

FYI I plan to provide a test binding to the user who is experiencing the problem so I'll report back...

hsudbrock commented 5 years ago

One question on the example for the color converter you mention above:

        if (on) {
            clusterOnOff.onCommand().get();
        } else {
            clusterOnOff.offCommand().get();
        }

The code does nothing with the result of the blocking get() call; what is the reason to call get() instead of returning immediately without blocking? (As it is done, e.g., in the same converter in changeBrightness when calling clusterOnOf.offComand().)

cdjackson commented 5 years ago

The code does nothing with the result of the blocking get() call

I agree - that's what I said earlier that we could in some cases avoid this delay for example in the OnOff commands, but I'm not sure that we can do this in all cases and it is not something that can be guaranteed (I think) -:

Do we change the converters and state that they need to return immediately? I'm not super keen on this as a blanket statement, although where possible, we probably should design converters to do this as it will at least help the issue.

I think there will be cases where order is important, and we need to send multiple commands in a converter, so I don't think we can rely on this?

Or do you disagree and you are saying that we should state that all converters must return immediately?

cdjackson commented 5 years ago

So the color converter is also an example of where the handleCommand cannot return immediately, and also probably needs the .get().

    public void handleCommand(final Command command) {
        try {
            if (command instanceof HSBType) {
                HSBType color = (HSBType) command;
                PercentType brightness = color.getBrightness();

                changeBrightness(brightness);

                if (delayedColorChange && brightness.intValue() != lastHSB.getBrightness().intValue()) {
                    Thread.sleep(1100);
                }

                if (supportsHue) {
                    changeColorHueSaturation(color);
                } else {
                    changeColorXY(color);
                }
            } else if (command instanceof PercentType) {
                changeBrightness((PercentType) command);
            } else if (command instanceof OnOffType) {
                changeOnOff((OnOffType) command);
            }
        } catch (InterruptedException | ExecutionException e) {
            logger.warn("{}: Exception processing command", endpoint.getIeeeAddress(), e);
        }
    }

In the above changeBrightness uses a blocking .get(). Without this, the following sleep() in the above code would not really wait for the device as needed here, and the subsequent color command would then fail. So this is a good example of my earlier point where I don't think we can state that converters MUST return immediately. Where possible, they should, but we can't enforce this I think.

It's also worth adding that this is why there is a .get() in the earlier code since the changeBrightness method can also call this code, and to ensure correct ordering, it needs to wait for the On before sending Color.

hsudbrock commented 5 years ago

I think there will be cases where order is important, and we need to send multiple commands in a converter.

I agree, there will be cases where multiple commands need to be sent in the correct order.

Or do you disagree and you are saying that we should state that all converters must return immediately?

I think that it would be nice if the converters would return immediately. Considering the example with multiple calls in the color converter, I see only two ways how this could be done, but I think that both are not an option for now:

One way to make converters return immediately would be creating a thread in the converter's handleCommand - but that just shifts the thread-pooling issue to the converter, so I don't consider this a good option at all.
The second way would be possible if the ZSS library would return CompletableFutures instead of Futures. With CompletableFutures, the converter could simply chain multiple ordered calls. This would make it possible to return quickly. In addition, this would avoid having to block a thread waiting for the transaction to complete. I find this idea somewhat appealing because it avoids blocking. But this would be a bigger change in the ZSS library (which is not really straightforward, as one would have to think of some way to tell the ZSS library which executor to use for completing the futures). Maybe something to think about in the future...

WDYT?

Do we simply increase the number of threads. This will help, but can never guarantee it will solve all issues (do we cater for people with 100 lights?).

Do we change the converters and state that they need to return immediately? I'm not super keen on this as a blanket statement, although where possible, we probably should design converters to do this as it will at least help the issue.

???

As mentioned above, I don't see how to easily go with solution 2. I think, however, that it would be worthwhile to remove those blocking calls in the converters that are actually not needed right now.

I don't see any better option than solution 1 that we can establish quickly. It would probably be useful to make the thread pool size configurable, using as default a sane value for "normal-size" setups (maybe 15?). We could document in the README how users can configure the thread pool size in non-standard cases like "user with 100 lights". The configuration could be done similar to how users can already tune the size of the currently used thread pool, using a configuration like in the openHAB demo setup here.

cdjackson commented 5 years ago

One way to make converters return immediately would be creating a thread in the converter's handleCommand - but that just shifts the thread-pooling issue to the converter, so I don't consider this a good option at all.

This is what we have now isn't it? The handleCommand method runs solely in its own thread, so there is no point in spawning another thread and returning in the original thread - this results in the same situation.

The second way would be possible if the ZSS library would return CompletableFutures instead

This is Java 8 though right? If so, it's not possible at the moment.

I think, however, that it would be worthwhile to remove those blocking calls in the converters that are actually not needed right now.

I agree - if there are any, but so far I've not found any that can be removed.

I don't see any better option than solution 1 that we can establish quickly.

I don't see how that can help - ultimately you're just starting more threads. Or are you suggesting not to use a thread pool? If so, I'm not sure it's a good idea, but also if you are proposing this, then why not just increase the current pool size - or use a boundless pool?

It would probably be useful to make the thread pool size configurable

Now I'm confused. So you want to increase the thread pool size, and also spawn another thread in the converter? I really don't understand your reasoning here so I'm probably missing something.

To explain my point above - ie why I think it's pointless to spawn a new thread in the converter, this is what we have

ZigBeeThingHandler.handleCommand {
    new Runnable {
        Converter.handleCommand() {
        }
    }
}

I therefore don't see any point in adding another thread inside the Converter.handleCommand but maybe I'm missing your point somewhere?

hsudbrock commented 5 years ago

Could it be that there is a misunderstanding? With "solution 1." and "solution 2." I was referring to "1." and "2." from the quotation from your comment (the quotation starting with 'WDYT?') - and not to "1." and "2." from the text above that quotation. Sorry if that was not clear.

cdjackson commented 5 years ago

Ah - sorry - I thought you were referring to the 1 and 2 in your message.

This setting probably needs to be applied reasonably consistently in different places - in the dongle and transaction manager as well as the binding so it would be nice if we could specify it as a config parameter?

hsudbrock commented 5 years ago

I agree that it would be good to have a config parameter for the thread pool size.

hsudbrock commented 5 years ago

Just one comment on "boundless pool" vs. "bounded pool with configurable size": I think that an unbounded pool could be an issue for really slow hardware (like the ones I have to deal with ;) ). But maybe we could have some "special value" for the config parameter that translates to an unbounded pool size (e.g., when setting the config parameter to -1).

cdjackson commented 5 years ago

I don’t mind to be honest - I’m happy setting a bound if the user can control it. It’s probably safer that way - I only suggested the boundless idea when I thought you were suggesting to spawn a thread in the converter.

On 13 Mar 2019, at 21:39, Henning Sudbrock notifications@github.com wrote:

Just one comment on "boundless pool" vs. "bounded pool with configurable size": I think that an unbounded pool could be an issue for really slow hardware (like the ones I have to deal with ;) ). But maybe we could have some "special value" for the config parameter that translates to an unbounded pool size (e.g., when setting the config parameter to -1).

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/openhab/org.openhab.binding.zigbee/issues/392#issuecomment-472616522, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_kQwm27Gsf6HONQmQZCquGqkXWw8Zsks5vWXAcgaJpZM4a_ECi.

hsudbrock commented 5 years ago

Ok, sounds good

cdjackson commented 5 years ago

This is closed by #393

openhab-bot commented 1 year ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/hue-binding-for-api-v2-3-4-0-0-4-0-0-0/146799/15

openhab-bot commented 1 year ago

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/hue-binding-for-api-v2-3-4-0-0-4-0-0-0/146799/20

openhab / org.openhab.binding.zigbee

Extremely slow execution of commands #392