Etcd occasionally keeps restarting

HuangQAQ commented 8 months ago

[provide a description of the issue]：

After deploying Fedora CoreOS on a three-node server and setting up OpenShift 4.13, all configured as master nodes, we noticed occasional warnings about "etcdrequestslow." The request duration for etcd requests could spike to 8-13 seconds intermittently. Since our servers are directly connected to both the internal network and the internet, we suspect that fluctuations in the internal network might be causing instability in the server's internal network. Is this a plausible scenario? We also considered the possibility of slow disk performance, but when checking the metrics, particularly the p99 indicator, there doesn't seem to be a consistent issue with slow disks. Even if there were occasional disk slowdowns, they wouldn't explain the intermittent nature of the problem or its spontaneous resolution. Therefore, we lean towards suspecting a network-related issue.

Typically, when encountering the "etcdrequestslow" warning, a temporary solution is to restart the three servers, which restores normal operation for a period. However, after a few days, the "etcdrequestslow" issue resurfaces. When requests time out, etcd becomes unresponsive, leading to a cascading effect on the entire OpenShift environment. The etcd version in use is 3.5.9.

Version

[provide output of the openshift version or oc version command]： openshift version ：4.13 etcd version：3.5.9

How should I address this intermittent issue?

HuangQAQ commented 8 months ago

@deads2k，@soltysh，@dgoodwin，@bparees

dgoodwin commented 8 months ago

Hi @HuangQAQ , apologies but this repo is primarily now just a home for e2e tests we run against the product and not somewhere we can offer this level of support. You likely will want to open a support case in the Customer Portal.

I can say that in the context of this repo and the multitude of e2e jobs we run and monitor, as well as conversations with the etcd team, etcdrequestslow usually boils back to disk issues of some kind.

/close

openshift-ci[bot] commented 8 months ago

@dgoodwin: Closing this issue.

In response to [this](https://github.com/openshift/origin/issues/28640#issuecomment-1983374321): >Hi @HuangQAQ , apologies but this repo is primarily now just a home for e2e tests we run against the product and not somewhere we can offer this level of support. You likely will want to open a support case in the Customer Portal. > >I can say that in the context of this repo and the multitude of e2e jobs we run and monitor, as well as conversations with the etcd team, etcdrequestslow usually boils back to disk issues of some kind. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

HuangQAQ commented 8 months ago

Thanks！

该邮件从移动设备发送

------------------ 原始邮件 ------------------ 发件人: "openshift/origin" @.>; 发送时间: 2024年3月7日(星期四) 晚上8:07 @.>; @.**@.>; 主题: Re: [openshift/origin] Etcd occasionally keeps restarting (Issue #28640)

Hi @HuangQAQ , apologies but this repo is primarily now just a home for e2e tests we run against the product and not somewhere we can offer this level of support. You likely will want to open a support case in the Customer Portal.

I can say that in the context of this repo and the multitude of e2e jobs we run and monitor, as well as conversations with the etcd team, etcdrequestslow usually boils back to disk issues of some kind.

/close

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

openshift / origin

Etcd occasionally keeps restarting #28640

Version