nerc-project / operations

Issues related to the operation of the NERC OpenShift environment

1 stars 0 forks source link

Follow Up #473 Timeouts between pods: Adding 3 Nodes to the infra-cluster to follow RH support recommendation #596

Closed schwesig closed 6 days ago

schwesig commented 1 month ago

Solution for Timeouts Between Observability and Loki Pods (#473) by Adding 3 Nodes

Following the discussion in today's "NERC: HU/BU Weekly Team Meeting," we agreed to add 3 more nodes to the infra cluster to work on the timeouts between Observability and Loki pods in issue #473.

Other solutions, like updating Openshift, are riskier or not possible in an urgent timeline.

[x] Add and configure 3 new nodes in the infra cluster.
[x] Update cluster settings to use new nodes.

to close this issue, the following will be moved to its own follow up issue

Assignees: @jtriley @larsks

CC: @schwesig @computate

Support Case: https://access.redhat.com/support/cases/#/case/03764352/discussion?commentId=a0a6R00000WI5X4QAL First thing first the configuration that you are using is not recommended for the production environment. For HA it is always recommended to have a separate master nodes

schwesig commented 1 month ago

https://access.redhat.com/support/cases/#/case/03764352/discussion?commentId=a0a6R00000WI5X4QAL First thing first the configuration that you are using is not recommended for the production environment. For HA it is always recommended to have a separate master nodes

schwesig commented 2 weeks ago

@jtriley @larsks, is there any news about this?

jtriley commented 2 weeks ago

@schwesig I've identified 3x nodes to pull for worker nodes for the infra cluster. I've updated the discovery iso on nerc-bootstrap for nerc-infra and updated DHCP. At this point I'm waiting on the request to networking to flip the ports to infra vlans at which point I can add those hosts to the cluster. I'll update this issue once that happens.

schwesig commented 2 weeks ago

Thank you, @jtriley 🥰

jtriley commented 2 weeks ago

@schwesig I've added 3x worker nodes to the infra cluster and also removed the worker role from the control plane hosts using:

$ oc --as=system:admin patch scheduler cluster --type merge -p '{"spec":{"mastersSchedulable":false}}'

(see https://access.redhat.com/solutions/4564851)

This is the current state of the cluster nodes now:

$ oc get nodes
NAME    STATUS   ROLES    AGE     VERSION
ctl-0   Ready    master   511d    v1.26.7+c7ee51f
ctl-1   Ready    master   511d    v1.26.7+c7ee51f
ctl-2   Ready    master   511d    v1.26.7+c7ee51f
wrk-0   Ready    worker   2m41s   v1.26.7+c7ee51f
wrk-1   Ready    worker   2m40s   v1.26.7+c7ee51f
wrk-2   Ready    worker   2m40s   v1.26.7+c7ee51f

jtriley commented 2 weeks ago

Just for drill, I did a rolling reboot of the 3x control plane nodes in case there were non-control-plane workloads on the hosts from before running the aforementioned patch command.

jtriley commented 2 weeks ago

Just a note the monitoring operator is currently not happy due to the fact that we're missing a required udev rule to properly configure networking on these new hosts which is preventing access to NESE storage. See https://github.com/OCP-on-NERC/nerc-ocp-config/pull/453 which should resolve this issue.

schwesig commented 6 days ago

Currently in the monitoring state. all needed nodea are available. currently memcache works fine,

observability-thanos-store-shard-0-1 and observability-thanos-store-shard-0-2 are good

observability-thanos-store-shard-0-0 creates

level=info ts=2024-06-24T13:46:37.338578312Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=893.994626ms duration_ms=893 cached=2128 returned=694 partial=0

level=warn ts=2024-06-24T13:46:44.853880072Z caller=bucket.go:637 msg="loading block failed" elapsed=7.514862006s id=01HQ6GM... err="create index header reader: write index header: new index reader: get TOC from object storage of 01HQ6GM.../index: Get \"https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index\": Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index. Retry again."

schwesig commented 6 days ago

Thanks, @jtriley, for taking care of the central part. I am removing you from the assignment; this is now on my side.

schwesig commented 6 days ago

closing this issue and moving the observing part to this https://github.com/nerc-project/operations/issues/618