There are issues that can only appear after running Longhorn for a long time (for example high CPU/RAM usage due to connection leak OR too many failed backups) . @ejweber and I think that we could catch these problems earlier before the release if there is a long-running Longhorn setup.
Test setup:
Create a long-live cluster
Deploy Longhorn master-head
Deploy workload (with liveness probe to detect when there is issue with Longhorn volume). The workload should also have some IO load
Deploy Rancher monitoring to track events like high CPU, RAM usage of Longhorn pods
We can add more topics to monitor if needed. Any idea? @longhorn/dev?
Upgrade strategy
Perform regular Kubernetes version upgrades (maybe once per month? )
Perform OS upgrade when needed
Perform Longhorn upgrade:
Frequency: maybe update Longhorn master-head to have newer fixes 1 once per week?
Engine: enable auto-engine upgrade
Instance manager: Scale up/down the workload to move engine/replica processes to a newer instance-manager version. This will remove the old instance manager version
What's the test to develop? Please describe
There are issues that can only appear after running Longhorn for a long time (for example high CPU/RAM usage due to connection leak OR too many failed backups) . @ejweber and I think that we could catch these problems earlier before the release if there is a long-running Longhorn setup.
Test setup:
Upgrade strategy