Closed schwesig closed 6 days ago
https://access.redhat.com/support/cases/#/case/03764352/discussion?commentId=a0a6R00000WI5X4QAL
First thing first the configuration that you are using is not recommended for the production environment. For HA it is always recommended to have a separate master nodes
@jtriley @larsks, is there any news about this?
@schwesig I've identified 3x nodes to pull for worker nodes for the infra cluster. I've updated the discovery iso on nerc-bootstrap for nerc-infra and updated DHCP. At this point I'm waiting on the request to networking to flip the ports to infra vlans at which point I can add those hosts to the cluster. I'll update this issue once that happens.
Thank you, @jtriley 🥰
@schwesig I've added 3x worker nodes to the infra cluster and also removed the worker role from the control plane hosts using:
$ oc --as=system:admin patch scheduler cluster --type merge -p '{"spec":{"mastersSchedulable":false}}'
(see https://access.redhat.com/solutions/4564851)
This is the current state of the cluster nodes now:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ctl-0 Ready master 511d v1.26.7+c7ee51f
ctl-1 Ready master 511d v1.26.7+c7ee51f
ctl-2 Ready master 511d v1.26.7+c7ee51f
wrk-0 Ready worker 2m41s v1.26.7+c7ee51f
wrk-1 Ready worker 2m40s v1.26.7+c7ee51f
wrk-2 Ready worker 2m40s v1.26.7+c7ee51f
Just for drill, I did a rolling reboot of the 3x control plane nodes in case there were non-control-plane workloads on the hosts from before running the aforementioned patch command.
Just a note the monitoring operator is currently not happy due to the fact that we're missing a required udev rule to properly configure networking on these new hosts which is preventing access to NESE storage. See https://github.com/OCP-on-NERC/nerc-ocp-config/pull/453 which should resolve this issue.
Currently in the monitoring state. all needed nodea are available. currently memcache works fine,
observability-thanos-store-shard-0-1 and observability-thanos-store-shard-0-2 are good
observability-thanos-store-shard-0-0 creates
level=info ts=2024-06-24T13:46:37.338578312Z caller=fetcher.go:478 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=893.994626ms duration_ms=893 cached=2128 returned=694 partial=0
level=warn ts=2024-06-24T13:46:44.853880072Z caller=bucket.go:637 msg="loading block failed" elapsed=7.514862006s id=01HQ6GM... err="create index header reader: write index header: new index reader: get TOC from object storage of 01HQ6GM.../index: Get \"https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index\": Connection closed by foreign host https://s3.openshift-storage.svc/observability-a6581571-.../01HQ6GM.../index. Retry again."
Thanks, @jtriley, for taking care of the central part. I am removing you from the assignment; this is now on my side.
closing this issue and moving the observing part to this https://github.com/nerc-project/operations/issues/618
Solution for Timeouts Between Observability and Loki Pods (#473) by Adding 3 Nodes
Following the discussion in today's "NERC: HU/BU Weekly Team Meeting," we agreed to add 3 more nodes to the infra cluster to work on the timeouts between Observability and Loki pods in issue #473.
Other solutions, like updating Openshift, are riskier or not possible in an urgent timeline.
to close this issue, the following will be moved to its own follow up issue
618
Assignees: @jtriley @larsks
CC: @schwesig @computate
Support Case: https://access.redhat.com/support/cases/#/case/03764352/discussion?commentId=a0a6R00000WI5X4QAL
First thing first the configuration that you are using is not recommended for the production environment. For HA it is always recommended to have a separate master nodes