Closed TheTrueRandom closed 1 year ago
@lolodomo Do you think we need an alive event here, too?
@J-N-K there is already an alive event from what I can see from the logs. However I don't see how it could solve any of those problems, as in all 3 cases the connection is successful and open.
Correct. Maybe 1 minute is too long and the connection dies before? But I don't understand why the subscription is not found. I'll try to check that.
For the client the connection is never closed (so I guess the server keeps it open), in my provided script (or also with curl) the connection stays open and will never be terminated by the server.
I just tried with your script above (slightly modified):
import got from 'got';
const res = await got.post(`http://localhost:8080/rest/sitemaps/events/subscribe`).json();
const location = res.context.headers.Location;
// choosing any time >= 2000 here will reproduce this issue reliably.
await new Promise(resolve => setTimeout(resolve, 4000));
const stream = await got.stream(`${location}?sitemap=default&pageid=default`);
for await (const data of stream) {
console.log(data.toString());
}
and the result is:
18:33:02.120 | DEBUG | qtp77466690-791 | est.sitemap.internal.SitemapResource | 179 | Client from IP 127.0.0.1 requested new subscription => got id 8b6e620a-e218-45a4-9bce-80632030e668.
18:33:02.282 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateEvent | 172 | Item 'Local_computer_Verfugbar' shall update to 5156 MiB
18:33:02.284 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateUpdatedEvent | 172 | Item 'Local_computer_Verfugbar' updated to 5406457856 B
18:33:06.179 | DEBUG | qtp77466690-379 | est.sitemap.internal.SitemapResource | 179 | Client from IP 127.0.0.1 requested sitemap event stream for subscription 8b6e620a-e218-45a4-9bce-80632030e668.
18:33:12.291 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateEvent | 172 | Item 'Local_computer_Verfugbar' shall update to 5051 MiB
18:33:12.296 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateUpdatedEvent | 172 | Item 'Local_computer_Verfugbar' updated to 5296357376 B
18:33:12.315 | DEBUG | OH-items-2 | est.sitemap.internal.SitemapResource | 179 | Sent sitemap event for widget 0000 to subscription 8b6e620a-e218-45a4-9bce-80632030e668.
18:33:22.295 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateEvent | 172 | Item 'Local_computer_Verfugbar' shall update to 5125 MiB
18:33:22.298 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateUpdatedEvent | 172 | Item 'Local_computer_Verfugbar' updated to 5373952000 B
18:33:22.299 | DEBUG | OH-items-1 | est.sitemap.internal.SitemapResource | 179 | Sent sitemap event for widget 0000 to subscription 8b6e620a-e218-45a4-9bce-80632030e668.
18:33:32.302 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateEvent | 172 | Item 'Local_computer_Verfugbar' shall update to 5094 MiB
18:33:32.304 | INFO | OH-eventexecutor-1 | openhab.event.ItemStateUpdatedEvent | 172 | Item 'Local_computer_Verfugbar' updated to 5341446144 B
18:33:32.305 | DEBUG | OH-items-2 | est.sitemap.internal.SitemapResource | 179 | Sent sitemap event for widget 0000 to subscription 8b6e620a-e218-45a4-9bce-80632030e668.
1
and
event: event
data: {"widgetId":"0000","label":"Verfügbar [5051 MiB]","visibility":true,"item":{"state":"5296357376 B","stateDescription":{"pattern":"%.0f MiB","readOnly":true,"options":[]},"type":"Number:DataAmount","name":"Local_computer_Verfugbar","label":"Verfügbar","category":"","tags":["Point"],"groupNames":[]},"descriptionChanged":false,"sitemapName":"default","pageId":"default"}
event: event
data: {"widgetId":"0000","label":"Verfügbar [5125 MiB]","visibility":true,"item":{"state":"5373952000 B","stateDescription":{"pattern":"%.0f MiB","readOnly":true,"options":[]},"type":"Number:DataAmount","name":"Local_computer_Verfugbar","label":"Verfügbar","category":"","tags":["Point"],"groupNames":[]},"descriptionChanged":false,"sitemapName":"default","pageId":"default"}
So the question is why does it occur for me and not for you. Also I noticed the script does not 100% reproduce it for me and it is definitely depending on the pageId. I think it is related to how many updates/changing items a page has. For my "electricity" page where i show multiple items which update frequently (PV, battery, grid power, device power, ..., each 1-2 times per second) the script can reproduce it about 70% of the time, and better with a longer sleep time. For other pages it is more difficult to reproduce.
Maybe after all it is related to load? I think also on the android app during the day the sitemaps are not refreshing way more often than at night, where way less items update per second.
Also would be interesting if issue 2 and 3 from the bug description also do not occur for you as they occur for me.
Do you know if there is a good guide to attach a debugger to my openhabian? If yes I can continue debugging if I find something.
My development system is started with start_debug.sh
, I don't know if that can be configured in openhabian.
You can then use remote debugging (in IntelliJ IDEA: "Remote JVM Debug", debugger mode "Attach to remote JVM", command line "-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005"). I believe there is a similar setting in Eclipse, but I don't know.
Make sure that you checkout the openhab-core repository and use the 3.4.x branch, so you have the correct sources.
Will try that thanks, also I wonder if your comment from #2891 is related
It has been reported in the past that high-traffic applications like HABApp and systems with a large number of subscribers are affected by performance limits in the REST/SSE implementation.
Especially during the day I would consider my openhab as high-traffic. Actually I think it could be about 50 item updates per second (a request to /rest/events
gives about 100 events/s). Besides this reported issue I did not encounter any problems though.
Also I noticed issue 2 and 3 do not occur at night, issue 1 still occurs. Right now I can only imagine it's somehow related to the lower rate of item updates at night.
@J-N-K I enabled trace logs for org.openhab.core.io.rest
and I noticed something interesting.
This log line should print the subscription count. When I have multiple active subscriptions I still see many of the following logs:
2023-05-20 21:19:13.911 [TRACE] [.openhab.core.io.rest.SseBroadcaster] - broadcast to potential 0 sinks
2023-05-20 21:19:13.913 [TRACE] [.openhab.core.io.rest.SseBroadcaster] - broadcast to potential 0 sinks
Nearly for all item updates I see that log line with 0 sinks. But then, in the middle of the 1000 logs (TRACE logs quite some), I can see the following:
2023-05-20 21:19:14.376 [TRACE] [.openhab.core.io.rest.SseBroadcaster] - broadcast to potential 0 sinks
2023-05-20 21:19:14.413 [TRACE] [.openhab.core.io.rest.SseBroadcaster] - broadcast to potential 6 sinks
2023-05-20 21:19:14.414 [DEBUG] [est.sitemap.internal.SitemapResource] - Sent sitemap event for widget 08 to subscription 77fb77d2-5a97-4e71-af94-a9192395da7f.
How is this possible? the number of sinks is varying without the sinks changing. This aligns with the issue 2 and 3 I described in the description. Any ideas on that?
Edit: Nvm I think it is unrelated. Just now (at around 21:30 CEST) magically all sitemaps/subscriptions work unconditionally (as every day around that time). When checking the trace logs then they are similar. I think this is because there are multiple SseBroadcaster instances.
placing a breakpoint here and here conditionally for one item which is updated every 1s with the openhab rest api shows the following, so somewhere in openhab (core) the events get lost.
Sun May 21 13:34:11 CEST 2023 rest received -6213
Sun May 21 13:34:12 CEST 2023 rest received -6225.1
Sun May 21 13:34:13 CEST 2023 rest received -6250.5
Sun May 21 13:34:16 CEST 2023 rest received -6209
Sun May 21 13:34:18 CEST 2023 rest received -6219.8
Sun May 21 13:34:19 CEST 2023 rest received -6180.8
Sun May 21 13:34:20 CEST 2023 rest received -6170.900000000001
Sun May 21 13:34:20 CEST 2023 rest received -6191.1
Sun May 21 13:34:21 CEST 2023 rest received -6184.900000000001
Sun May 21 13:34:21 CEST 2023 rest received -6171.3
Sun May 21 13:34:23 CEST 2023 rest received -6254.6
Sun May 21 13:34:25 CEST 2023 rest received -966.9000000000001
Sun May 21 13:34:27 CEST 2023 rest received -6186.8
Sun May 21 13:34:29 CEST 2023 rest received -6191.6
Sun May 21 13:34:29 CEST 2023 rest received -968.5
Sun May 21 13:34:29 CEST 2023 sitemap received Grid [-6187 W]
Sun May 21 13:34:30 CEST 2023 rest received -6190.400000000001
Sun May 21 13:34:32 CEST 2023 rest received -1020.1
Sun May 21 13:34:32 CEST 2023 rest received -988.2
Sun May 21 13:34:33 CEST 2023 rest received -951.2
@J-N-K I'm quite sure I found the issue. This threadPoolExecutor right now has a task queue size of 984912 tasks for me.
I haven't figured out yet why it is so slow and the tasks pile up but I could imagine it's due to the same threadpool contains tasks for sitemap handling and also persistence handling.
Possible fixes I can think of right now:
ItemStatePredictedEvent
. Sitemap event sending is fast and cheap from what I can tell so far, there is no need to queue it to the item changed task executorThere was an issue with persistence handling. This has been fixed in OH4
@J-N-K checked with openhab4, it seems a bit faster but doesn't solve the issue, as in my last post described the queue still grows faster then it is processed. after 30min runtime it is at 15000. And also the CPU usage is incredible low. So it is probably still bottlenecked with i/o (as described in my previous post).
We should at very least use a different thread pool for sitemap updates as for persistence handling.
Do you know which task is the bottleneck? IIRC the one-thread solution was introduced to prevent issues with concurrent event processing in consumers.
Unfortunately I don't. The sitemap subscriptions seem to be fast though. And also as described earlier, the CPU load of openhab is very low (40% on htop with a 4 core system, so 400% would be available). Network traffic is for sure also not the bottleneck, as I can do (non sitemap related) subscriptions as many as I want (SSE or websocket) which are all supplied with all events without any delay or htop spikes. I did not dive into the persistence logic, however I could imagine it is related as I am persisting many items (never noticed anything wrong there though)
Sitemap subscriptions definitely need some fix (some suggestions mentioned earlier). They are completely unusable for openhab systems with many (30+) item updates per second. I'm surprised the issue was not noticed earlier. 30-50 updates per second does not sound much for me and also besides sitemap subscriptions openhab handles it with ease.
I suppose you could try disabling persistence temporarily (unless you've already tried that). That would rule persistence in or out as the culprit.
@mhilbush disabling the persistence will 99% mitigate the issue. Right now my setup is at the edge of what this single executor thread pool can manage. It's exactly 30 item changes per second. If I increase it to 35 item changes per second the queue will grow. Disabling the persistence will therefore not give us any hints, if it "solves" the issue, we still need a fix, as persistence is a wanted feature. And if it doesn't solve it we need the fix even more.
Interesting would be the ratio (of runtime) for persistence tasks and sitemap subscription tasks. To get good numbers I think the only good way to do that would be load testing. Nevertheless I would suggest to at least split the executors, so one will do persistence and another one sitemap tasks. With that separated I could also measure again for my setup where are the limits for each of those executors then (if load tests are too much effort)
Can you let us know why you have 30 item changes per second? That sounds awfully much
Can you let us know why you have 30 item changes per second? That sounds awfully much
Clearly! What kind of home automation system could require such amount of refreshed data every second?
30-50 updates per second does not sound much for me and also besides sitemap subscriptions openhab handles it with ease.
That is probably an order of magnitude more than I've experienced personally and I've seen in logs reported to the forum.
I'm not saying improving OH's performance here isn't a good idea, but at least in our experience that's well outside the expected number of changes for a home automation system.
Nevertheless I would suggest to at least split the executors, so one will do persistence and another one sitemap tasks.
I think first understanding why they are one thread in the first place needs to be understood. simply splitting them may solve this specific problem but reintroduce a problem for everyone else. Both need to be accommodated. I don't know the right keywords to find the PR where the one thread was implemented.
@rkoshak the code was introduced January 2019 (that's more than 4 years) with a single commit and never touched since then. I would assume it's forgotten with the introduction of OH3 ui.
@J-N-K @lolodomo why not? If you are interested I can show you my home automation system with openhab. I love it, it's blazing fast, I have multiple tablets in my home with openhab sitemaps open, showing information about photovoltaic, electricity usage of devices, charge rates of car(s), charge rates of batteries, water flow rates, pool control like filter pressure and water flow and much more. Seeing those values update in real time is amazing, it's quite a difference to seeing the PV output 1 time every 10s compared to see 1-2 times a second the output of every PV string. And that's not everything, There's much more on my mind to add.
As I said earlier, I'm running it on a raspbery pi, and it's not even close to struggle. CPU load is low and besides the sitemap subscriptions everything works perfectly and fast. And still I'm quite sure it can also handle that with ease, another point to proof that is that I can add as many subscriptions as I want and it's not affecting the issue, it's either all subscriptions (sitemaps) suffering at the same time or none. And even when the issue is occurring, load on htop is not high.
I'm preparing a test bundle that has a thread per subscriber. This should still ensure the order of the events when they arrive at the subscriber but make the subscribers independent of each other. Would you be willing to test that?
absolutely, thank you! Personally I'm struggling to compile openhab myself, otherwise I would have already tried a bit :S actually I was wondering if the order even matters, because even if the queue was stacked with 1 million tasks, if randomly a task would update just the items I have opened on a sitemap it would still get the latest value, I never saw something updating with the wrong value strangely (not 100% sure)
Edit: Like in this example, https://github.com/openhab/openhab-core/issues/3620#issuecomment-1556156449 here the queue was stacked with 1million tasks, though when the sitemap received an update it was the latest value.
@rkoshak the code was introduced January 2019 (that's more than 4 years) with a single commit and never touched since then.
That issue was part of moving all the Eclipse Smart Home stuff back into OH. I think we would need to root around the PRs for ESH to figure how why that change was made.
I would assume it's forgotten with the introduction of OH3 ui.
We don't know that it was UI driven in the first place. That's my main concern. Maybe it is "forgotten" by MainUI but something in persistence or elsewhere depends on it. I just don't want to break things blindly just because the original PR is old. I looks like @J-N-K has you covered so it's all good.
@rkoshak I also don't know. With my "expertise" looking through the code for a few days now, I would have used this method for sitemap subscription handling, just the same way it is used for websockets, which is really fast. And also this method is already used exactly for this usecase, just only for ItemStatePredictedEvent
and not for ItemStateChangedEvent
which I would be really interested in why.
@TheTrueRandom
On the KARAF console:
update org.openhab.core https://janessa.me/esh/org.openhab.core-4.0.0-SNAPSHOT.jar
then stop openHAB and re-start. This only works on OH4, latest SNAPSHOTs, otherwise there will be incompatibilities.
Ok thanks I will try. How to revert back to the original version?
Installing an arbitrary add-on should refresh the bundle to the original version. A restart is probably needed.
Tested it, I don't really see a difference. I think I applied it correctly, because on the UI it reads openhab 4.0.0
instead of openhab 4.0.0.M2
. Still 30 updates/s is the magical threshold. Going up to 40 or 50 results in the same issue. Inspecting the jar however I don't really see a change, still the same executor with 1 worker thread, placing a breakpoint shows the same behavior, the task queue is slowly increasing in size. load on htop is still not high (below 50%).
Also going back to the original version doesn't work. I tried as you suggested and also with the openhabian config (and a restart afterwards), still reads openhab 4.0.0
on the UI. How to go back to openhab 4.0.0.M2
?
If you see only one thread, then it is not working. Can you do
list -s -u org.openhab.core
You can hog back to the original version by installing an add-on.
START LEVEL 100 , List Threshold: 50
ID │ State │ Lvl │ Version │ Symbolic name │ Update location
────┼────────┼─────┼────────────────────┼──────────────────┼──────────────────────────────────────────────────────────────────────────────────────────
150 │ Active │ 80 │ 4.0.0.202305231607 │ org.openhab.core │ https://janessa.me/esh//org.openhab.core-4.0.0-SNAPSHOT.jar
I installed an add-on, with restart it did not go back to the original version. Could you share the source code? When decompiling it, QueueingThreadPoolExecutor.class still has only one queueThread and GenericItem.class.notifyListeners still has the same logic to invoke it.
The change is in the EventHandler
. It creates an executor per subscriber, so each subscriber itself is single-threaded.
Not sure how that would help for the issue. The sitemap subscription does not use the receive
method, only for irrelevant updates like mentioned here: https://github.com/openhab/openhab-core/issues/3620#issuecomment-1557739335
I have done some more refactoring for the sitemap resource. Pleas ADDITIONALLY
update org.openhab.core.io.rest.sitemap https://janessa.me/esh/org.openhab.core.io.rest.sitemap-4.0.0-SNAPSHOT.jar
cleaning cache should revert all changes
@J-N-K thanks! again didn't find the change you did, but whatever it is, that's definitely a huge improvement. Testing with slightly more than 50 updates/s, everything works perfectly. Sitemaps responses are fast and instant, nothing is delayed. CPU load (of my raspberry) is still low, slightly higher (obviously) at around 50%. Maybe you can push the code or open a PR, I'm quite interested how you did it.
I noticed some units are changed on the sitemap though, percentages now show e.g. 3700% instead of 37% and liters show 12000l instead of 12l, but I assume it's something fixed in 4.0 M2 (as with this version it works) while your changes are based on 4.0 ?
Sitemap Event Subscription is incredible unreliable. There are many scenarios where events are not sent or subscriptions stall. This occurs and can be reproduced within the browser (basic UI) and the android app (sitemaps).
Expected Behavior
All changes to items should be reflected in real time for the sitemap which is currently opened.
Current Behavior
There are many issues where this is not the case:
Subscription is stalled completely after creation (e.g. android app refresh swipe down, or browser reload). I managed to reproduce it reliably with this script, though I'm not 100% sure that this is what is happening when it occurs on the android app.
This will create the following log output (grepping for this subscription id):
Also very important to notice is that the connection is kept open! It is not closed as I think was desired here. This tricks the client to think the connection will send events but it never will. You can also easily reproduce it with curl and manually copy the subscription id from request 1 to request 2, as this will always take longer than 2 seconds.
Often if the subscription is successfully subscribed by the client, updates are sent not fast enough. I have a sitemap which updates items multiple times a second, but often I can discover this on the logs:
Here somehow only a few events are sent within minutes, while in reality multiple events should be sent per second. In this state, whenever I refresh the page/app manually it will update to the latest correct values which clearly indicates the current values were outdated.
I thought that openhab might be overloaded as indeed I have a setup with 10-30 item updates per second, but I don't think it is as CPU load is <= 50% and also everything loads incredible fast when refreshing manually and also all other services (like persistence) don't encounter any problems.
autoupdate="false"
are never (event-based) updated on the UI when changed on the UI. To reproduce, create such an item, include it to a sitemap asSetpoint
, write a script/rule which listens for commands of this item and update the state of the item. Now the logs look like this when changing it with the setpoint element:but the sitemap event stream never contains this change even for a working subscription (where point 1 and 2 did not occur).
Possible Solution
For me it looks like the sitemap event subscription could need a rework, maybe replacing it entirely with websockets as openhab supports this by now already.
Steps to Reproduce (for Bugs)
provided in the description
Context
I'm using sitemaps to show the state of my home. Sitemaps are awesome and in my opinion far superior to the OH3 UI (e.g. they are blazing fast, no cumbersome login setups required on android, native app navigation works, video streams work through openhab cloud, way easier to setup, etc.). I think it's a pity they don't see a lot of love since OH3 UI.
Your Environment