Closed HuangQAQ closed 8 months ago
@deads2k,@soltysh,@dgoodwin,@bparees
Hi @HuangQAQ , apologies but this repo is primarily now just a home for e2e tests we run against the product and not somewhere we can offer this level of support. You likely will want to open a support case in the Customer Portal.
I can say that in the context of this repo and the multitude of e2e jobs we run and monitor, as well as conversations with the etcd team, etcdrequestslow usually boils back to disk issues of some kind.
/close
@dgoodwin: Closing this issue.
Thanks!
该邮件从移动设备发送
------------------ 原始邮件 ------------------ 发件人: "openshift/origin" @.>; 发送时间: 2024年3月7日(星期四) 晚上8:07 @.>; @.**@.>; 主题: Re: [openshift/origin] Etcd occasionally keeps restarting (Issue #28640)
Hi @HuangQAQ , apologies but this repo is primarily now just a home for e2e tests we run against the product and not somewhere we can offer this level of support. You likely will want to open a support case in the Customer Portal.
I can say that in the context of this repo and the multitude of e2e jobs we run and monitor, as well as conversations with the etcd team, etcdrequestslow usually boils back to disk issues of some kind.
/close
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
[provide a description of the issue]:
After deploying Fedora CoreOS on a three-node server and setting up OpenShift 4.13, all configured as master nodes, we noticed occasional warnings about "etcdrequestslow." The request duration for etcd requests could spike to 8-13 seconds intermittently. Since our servers are directly connected to both the internal network and the internet, we suspect that fluctuations in the internal network might be causing instability in the server's internal network. Is this a plausible scenario? We also considered the possibility of slow disk performance, but when checking the metrics, particularly the p99 indicator, there doesn't seem to be a consistent issue with slow disks. Even if there were occasional disk slowdowns, they wouldn't explain the intermittent nature of the problem or its spontaneous resolution. Therefore, we lean towards suspecting a network-related issue.
Typically, when encountering the "etcdrequestslow" warning, a temporary solution is to restart the three servers, which restores normal operation for a period. However, after a few days, the "etcdrequestslow" issue resurfaces. When requests time out, etcd becomes unresponsive, leading to a cascading effect on the entire OpenShift environment. The etcd version in use is 3.5.9.
Version
[provide output of the
openshift version
oroc version
command]: openshift version :4.13 etcd version:3.5.9How should I address this intermittent issue?