Closed GregBlow closed 1 month ago
Greg Blow 9 minutes ago hmm, thanks for reporting. I don't have an answer as of right now, will need to investigate.
Greg Blow 5 minutes ago google has a single hit for that URL ~9 years ago https://serverfault.com/questions/723620/etcd2-fails-to-start-on-my-coreos-node
Server FaultServer Fault etcd2 fails to start on my CoreOS node I am trying to start etcd2 in my CoreOS node. I have this in my cloud-config: coreos: etcd2: discovery: https://discovery.etcd.io/new?size=1 advertise-client-urls: http://127.0.0.1:237/...
Greg Blow 3 minutes ago behaviour using curl is still as reported in that thread for me though ( curl --silent -H "Accept: text/plain" https://discovery.etcd.io/new?size=1 returns a url with a uid)
Greg Blow 1 minute ago since that url seems to be working (plugging it into a web browser doesn't work, but I don't think is meant to) I'm inclined to believe this is most likely to be network related.
Greg Blow 1 minute ago Is your seed host (where the cluster is being launched from) your PC? What happens if you curl --silent -H "Accept: text/plain" https://discovery.etcd.io/new?size=1 from there?
problem may be related to RSP floating IP count. Project was at capacity.
no improvement seen
have been able to deploy working cluster using:
#!/bin/bash
# Define the cluster name as a parameter
cluster_name="$1"
# Define the rest of the parameters
cluster_template="stv-template"
master_count=1
node_count=3
docker_volume_size=200
labels="admission_control_list=\"NodeRestriction,NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,PersistentVolumeClaimResize,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,RuntimeClass\""
merge_labels=true
keypair="greg-perfsonar"
# Create the cluster using the parameters
openstack coe cluster create --cluster-template "$cluster_template" \
--master-count "$master_count" \
--node-count "$node_count" \
--docker-volume-size "$docker_volume_size" \
--labels "$labels" \
--merge-labels \
--keypair "$keypair" \
"$cluster_name"
Need to determine if original context still sees problems and if so what determining factor is.
It appears at some point the etcd issue was mitigated, at which point deploying a Heat based cluster (https://somerville.ed.ac.uk/project/stacks/stack/8ce826ea-cf31-47e7-b798-b6a131c11d65/) would complete successfully. However since then new attempts to create heat based clusters fail at creation of ResourceGroup kube_minions.
Following appears in heat engine log:
2024-07-19 12:35:50.520 16 INFO heat.engine.resource [None req-bb6a1093-8bc1-4812-a40b-6a9788cbd2da - - - - - -] CREATE: ResourceGroup "kube_minions" [9e837132-f39e-408e-8e54-58f864c62bfa] Stack "gb-test-2-da5477gujtt4" [27abac1d-9604-4653-be36-915f8a811fc1]
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource Traceback (most recent call last):
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 922, in _action_recorder
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource yield
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 1034, in _do_action
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource yield from self.action_handler_task(action, args=handler_args)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resource.py", line 984, in action_handler_task
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource done = check(handler_data)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 429, in check_create_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource return super(ResourceGroup, self).check_create_complete()
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/stack_resource.py", line 408, in check_create_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource return self._check_status_complete(self.CREATE)
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource File "/var/lib/kolla/venv/lib64/python3.9/site-packages/heat/engine/resources/stack_resource.py", line 450, in _check_status_complete
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource raise exception.ResourceFailure(status_reason, self,
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource heat.common.exception.ResourceFailure: Error: resources.kube_minions.resources[1].resources.node_config_deployment: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2024-07-19 12:35:50.520 16 ERROR heat.engine.resource
mitigated by move to CAPI magnum. Will not fix.