longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
6.12k stars 602 forks source link

[TEST] Test long-running Longhorn installation #6367

Open PhanLe1010 opened 1 year ago

PhanLe1010 commented 1 year ago

What's the test to develop? Please describe

There are issues that can only appear after running Longhorn for a long time (for example high CPU/RAM usage due to connection leak OR too many failed backups) . @ejweber and I think that we could catch these problems earlier before the release if there is a long-running Longhorn setup.

Test setup:

  1. Create a long-live cluster
  2. Deploy Longhorn master-head
  3. Deploy workload (with liveness probe to detect when there is issue with Longhorn volume). The workload should also have some IO load
  4. Deploy Rancher monitoring to track events like high CPU, RAM usage of Longhorn pods
  5. Also monitor Longorn metrics to detect the number of backup errors https://longhorn.io/docs/1.5.1/monitoring/metrics/
  6. We can add more topics to monitor if needed. Any idea? @longhorn/dev?

Upgrade strategy

  1. Perform regular Kubernetes version upgrades (maybe once per month? )
  2. Perform OS upgrade when needed
  3. Perform Longhorn upgrade:
    1. Frequency: maybe update Longhorn master-head to have newer fixes 1 once per week?
    2. Engine: enable auto-engine upgrade
    3. Instance manager: Scale up/down the workload to move engine/replica processes to a newer instance-manager version. This will remove the old instance manager version
innobead commented 1 year ago

@khushboo-rancher Please work with @yangchiu .