lsst-uk / somerville-operations

User issue reporting and tracking for the Somerville Cloud
0 stars 0 forks source link

magnum etcd error #174

Closed GregBlow closed 6 days ago

GregBlow commented 3 months ago

Amanda Ibsen 11 days ago Heya, I'm trying to create a cluster, but it fails with the following error: status_reason | Failed to get discovery url from 'https://discovery.etcd.io/new?size=1'. Does anybody know what the issue is? It was working fine for me earlier today 12 replies

Greg Blow 6 days ago Are you still seeing problems? (Was it transient?)

Amanda Ibsen 4 days ago Hi, sorry, forgot to reply to this because I was able to use a cluster that already existed at the time I had the issue. I've just tried to create another one to test whether it was a transient issue, but I get the same error :disappointed:

Amanda Ibsen 4 days ago so, after doing some investigating, apparently kubernetes uses etcd to generate discovery urls

Amanda Ibsen 4 days ago the setup of etcd is handled by magnum

Amanda Ibsen 4 days ago and apparently the default setup is to use https://discovery.etcd.io/ etcdetcd Discovery service protocol Discover other etcd members in a cluster bootstrap phase

Amanda Ibsen 4 days ago that service is up, but if I do curl -I https://discovery.etcd.io I get a 301 etcdetcd Discovery service protocol Discover other etcd members in a cluster bootstrap phase

Amanda Ibsen 4 days ago If I do curl -L -I https://discovery.etcd.io, I get 301 and then 200

Amanda Ibsen 4 days ago and it redirects to https://etcd.io/docs/v3.5/dev-internal/discovery_protocol etcdetcd Discovery service protocol Discover other etcd members in a cluster bootstrap phase

Amanda Ibsen 4 days ago so my guess (but I don't know anything about this) is that discovery etcd was moved to that other url and for whatever reason, the redirection doesn't happen at the time of creation of a cluster

Amanda Ibsen 4 days ago I don't know if this can be fixed by updating the magnum configuration (which I don't think I can do) or if it can be specified in the heat template (which I could try)

Amanda Ibsen 4 days ago this is not urgent, as the production rsp is up and running, but it will be if for whatever reason (I may be the reason) it goes down and another cluster needs to be created for redeploy

GregBlow commented 3 months ago

Greg Blow 9 minutes ago hmm, thanks for reporting. I don't have an answer as of right now, will need to investigate.

Greg Blow 5 minutes ago google has a single hit for that URL ~9 years ago https://serverfault.com/questions/723620/etcd2-fails-to-start-on-my-coreos-node

Server FaultServer Fault etcd2 fails to start on my CoreOS node I am trying to start etcd2 in my CoreOS node. I have this in my cloud-config: coreos: etcd2: discovery: https://discovery.etcd.io/new?size=1 advertise-client-urls: http://127.0.0.1:237/...

Greg Blow 3 minutes ago behaviour using curl is still as reported in that thread for me though ( curl --silent -H "Accept: text/plain" https://discovery.etcd.io/new?size=1 returns a url with a uid)

Greg Blow 1 minute ago since that url seems to be working (plugging it into a web browser doesn't work, but I don't think is meant to) I'm inclined to believe this is most likely to be network related.

Greg Blow 1 minute ago Is your seed host (where the cluster is being launched from) your PC? What happens if you curl --silent -H "Accept: text/plain" https://discovery.etcd.io/new?size=1 from there?

GregBlow commented 3 months ago

problem may be related to RSP floating IP count. Project was at capacity.

GregBlow commented 3 months ago

no improvement seen

GregBlow commented 2 months ago

have been able to deploy working cluster using:

#!/bin/bash

# Define the cluster name as a parameter
cluster_name="$1"

# Define the rest of the parameters
cluster_template="stv-template"
master_count=1
node_count=3
docker_volume_size=200
labels="admission_control_list=\"NodeRestriction,NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,PersistentVolumeClaimResize,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,RuntimeClass\""
merge_labels=true
keypair="greg-perfsonar"

# Create the cluster using the parameters
openstack coe cluster create --cluster-template "$cluster_template" \
                             --master-count "$master_count" \
                             --node-count "$node_count" \
                             --docker-volume-size "$docker_volume_size" \
                             --labels "$labels" \
                             --merge-labels \
                             --keypair "$keypair" \
                             "$cluster_name"

Need to determine if original context still sees problems and if so what determining factor is.

GregBlow commented 2 months ago

It appears at some point the etcd issue was mitigated, at which point deploying a Heat based cluster (https://somerville.ed.ac.uk/project/stacks/stack/8ce826ea-cf31-47e7-b798-b6a131c11d65/) would complete successfully. However since then new attempts to create heat based clusters fail at creation of ResourceGroup kube_minions.

image

Following appears in heat engine log:

2024-07-19 12:35:50.520 16 INFO heat.engine.resource [None req-bb6a1093-8bc1-4812-a40b-6a9788cbd2da - - - - - -] CREATE: ResourceGroup "kube_minions" [9e837132-f39e-408e-8e54-58f864c62bfa] Stack "gb-test-2-da5477gujtt4" [27abac1d-9604-4653-be36-915f8a811fc1]
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource Traceback (most recent call last):
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 922, in _action_recorder
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     yield
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 1034, in _do_action
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     yield from self.action_handler_task(action, args=handler_args)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 984, in action_handler_task
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     done = check(handler_data)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 429, in check_create_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     return super(ResourceGroup, self).check_create_complete()
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/stack_resource.py", line 408, in check_create_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     return self._check_status_complete(self.CREATE)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource   File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/stack_resource.py", line 450, in _check_status_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource     raise exception.ResourceFailure(status_reason, self,
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource heat.common.exception.ResourceFailure: Error: resources.kube_minions.resources[1].resources.node_config_deployment: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource
GregBlow commented 6 days ago

mitigated by move to CAPI magnum. Will not fix.