1.2.0 Update rendered implementation unusable

theOrakleAtBurns commented 11 months ago

I love the innovation and the additions, but I have 40 station and 10 programs...

I'll let you do the math, but this created more entries than Hass could handle with your current threading.

EdLeckert commented 11 months ago

Well, that's an impressive system you have there.

Were you able to roll back to the previous version while we figure this out?

Also, may I ask what features of the integration you mostly use?

theOrakle commented 11 months ago

I used the wrong account... but same me...

I was able to rollback via HACS->redownload (select 1.1.7 for version). cleaning up all the offline entities created took a bit more work. Including deleting several 10K lines from .storage/core.entity_registry.

At a high level, I created a switch for each zone (for manual zone runs), let OS run the schedules and just observe the states for visibility, override and disable if it is currently raining...

EdLeckert commented 11 months ago

Sorry you had to deal with that. I've had HA upgrades blow up my system and it's no fun.

May I ask what you're using for your HA server?

Also, can you tell me a bit more about your experience with the update...briefly, what worked and what didn't?

theOrakle commented 11 months ago

No prob at all man - so hass runs on a ESX cluster of NUCs (i7,32 GB) connected to an all flash synology array via 10 G iSCSI. The db is Postgres on another VM on the same cluster. Each node has 2 core & 8GB.

On the upgrade front:

it was the only custom component that was ungraded
Watched the logs after restart & it complained that a value needed to be 0-23
Navigated to the integration via hass just fine
One I clicked on the device, it couldn’t render the 630 entities in under 2 minutes and scrolling was not responsive on Safari, Chrome or companion app

theOrakle commented 11 months ago

Note - I have more entities on other integrations with more devices.

theOrakle commented 11 months ago

Forgot to add… ran top on both hass and the db while I was waiting for the integration to render one of the times and no I/o or cpu wait

theOrakle commented 11 months ago

I have another environment (yellow with a 6Gb/s NvME), that I can use if you want me to test something for you

EdLeckert commented 11 months ago

Thanks for the offer. I might take you up on that.

So to summarize, the only performance related issue you encountered was the loading of the device page for the OpenSprinkler integration. The rest of the system was working normally? And there was only one device for the integration?

theOrakle commented 11 months ago

You are correct sir.

EdLeckert commented 11 months ago

OK, I set up an environment to simulate your system as best as I can with only 8 stations and 1 program in my one controller. I modified the integration to create entities for 10 copies of the one program and 5 copies of the 8 stations, for a total matching your system. This resulted in the creation of 630 entities for the 1 device in the integration. I'm running VirtualBox on a laptop, Chrome on same laptop, wired connection to my OS controller. Far from an ideal comparison, but gives me an idea of the impact.

What I observed when using V 1.1.17 was that the device page loaded in under a second. Using V 1.2.1 it takes about 4 seconds. There is no interaction with the OS controller due to the page load (I reduced the frequency of polling from 5 seconds to 5 minutes for the test, and there was no traffic.) The page is simply loading from the HA entities.

So I'm not sure what was causing your browsers to hang. Perhaps some entities did not have a valid state yet? Maybe the 5 second poll interval is too fast for a system of your size with the added entities?

If you still want to test some ideas and can afford to create and destroy installations easily, there are a couple of things you could try using the latest release on your test system. It will still create the 630 entities if you don't comment out the lines below, so be warned.

Of the 630 entities, 400 are the station durations, one for each station for each program. If you go to opensprinkler/number.py you can comment out the following lines with a leading "#" and they won't get created. Then start HA and configure the integration. Mine took about 1 second to load the device page after this change.

    for _, program in controller.programs.items():
        for _, station in controller.stations.items():
            entities.append(
                ProgramDurationNumber(entry, name, program, station, coordinator)
            )

If you make this change after the entities are already created, they won't get deleted, but they won't be loaded into the device page.

If all goes well, you could try uncommenting the lines and reducing the polling interval. In opensprinkler/const.py change DEFAULT_SCAN_INTERVAL to something like 30 and restart HA.

If you have trouble with the device page at any point, you could check Developer tools\STATES and see if the entities look reasonable.

Oh, and I would definitely disable your production integration while doing this, so only one instance is hitting your OS controller at a time.

Finally, as to your Z-Wave example above: 1373 entities spread over 78 devices is an average of fewer than 18 entities per device, so you wouldn't be loading many entities on each device page.

Let me know if you have any questions or if you're not able to do this. Thanks!

theOrakle commented 11 months ago

I'll try all of this over the weekend...

For now, I disabled new entities... and all is good.

I'm also seeing the following periodically:

2023-09-22 23:39:27.101 INFO (MainThread) [backoff] Backing off _request_http(...) for 0.1s (pyopensprinkler.OpenSprinklerConnectionError: Cannot connect to controller)
2023-09-22 23:39:32.127 INFO (MainThread) [backoff] Backing off _request_http(...) for 0.6s (pyopensprinkler.OpenSprinklerConnectionError: Cannot connect to controller)

It honestly feels like I am overrunning my OS 3... It's on a wired network 1 Gb (the NIC only supports 10/100)

I know that my setup is much larger than anyone else's, so don't burn midnight oil on it. I originally open this to make sure you did the math of what that did at scale, but this looks like another fine tuning exercise for my ridiculous hobby.

theOrakle commented 11 months ago

You nailed it...

"Oh, and I would definitely disable your production integration while doing this, so only one instance is hitting your OS controller at a time."

After the update, OS 3 hardware cannot keep up with the requests from 2 hass systems.

EdLeckert commented 11 months ago

I suspect you're going to find that one instance polling every 5 seconds is too much for a system your size. But that's easily fixed.

EdLeckert commented 11 months ago

Watched the logs after restart & it complained that a value needed to be 0-23

See #254.

vinteo / hass-opensprinkler

1.2.0 Update rendered implementation unusable #248