Please add impact of short disruptions of ROKS worker nodes as design in its trouble shooting guide

ka7harada commented 6 months ago

Request Please add disclaimer for short disruptions of ROKS in docs

Urgency As soon as possible Customer raised many support cases for asking the cause of brief disruption on worker node in last 2 months.

Reason: 1.Impact that customer needs to accept for worker nodes as design is not described in DOCS but support case request customer to accept.

VPC/VSI has written disclaimer in docs for similar situation Possible impacts to virtual server instances during maintenance operations When nondisruptive live migration occurs, the virtual server experiences a brief pause of around 10 seconds, and in some cases up to 30 seconds. You are not notified in advance of nondisruptive migration. The virtual server instance is not restarted as part of this process.

There are mismatch guidance among ACS, docs and each service. To resolve this, the fix needs to be done soon. Please confirm the wording written in "where to change " > "Expected" section w/ IKS dev team if needed.

Where to change Location: IBM Cloud Docs > Red Hat OpenShift on IBM Cloud > Troubleshooting worker nodes in Critical or NotReady state

Current

Expected Update note of “Important” section as below.

Important: Check the IBM Cloud health and status dashboard for any notifications or maintenance updates that might be relevant to your worker nodes. These notifications or updates might help determine the cause of the worker node failures.

When nondisruptive maintenance occurs, worker node experiences a brief pause up to around 60 seconds as same as virtual servers. You can also increase the high availability by distributing your app setup across multiple worker nodes and clusters to mitigate the impact.

ka7harada commented 6 months ago

@derekpoindexter As record let me post case# here also

CS3920827 is the specific one of worker node not ready issue due to none-disruptive category maintenance.

CS3918206 is the case when customer face unexpectedly long worker node not ready issue caused by live migration that is also falls into none disruptive and notification less maintenance

These 2 cases are raised from a customer who faced more than 10 times workernodenot ready in 2024/APR and May

kKronstainBrown commented 6 months ago

No changes are going to be made to the documentation at this time. An issue has been opened with development to investigate the reasons behind the behavior.

ka7harada commented 5 months ago

@kKronstainBrown Both Cases are closed. Especially for CS3920827, can you work w/ Dev again to add the disclaimer to IBM Cloud docs? IKS user has WN not ready issue frequently while non-disruptive category master upgrade maintenance.

kKronstainBrown commented 4 months ago

Responded in Slack that no changes are going to be made to the documentation about this issue at this time. Development also reported no update.

ibm-cloud-docs / containers

Please add impact of short disruptions of ROKS worker nodes as design in its trouble shooting guide #2682