Availability zone endpoint does not work

choffmeister commented 2 years ago

Tested it a several different machines in the Hetzner cloud, but http://169.254.169.254/hetzner/v1/metadata/availability-zone (dispatched here always returns a HTTP 404):

curl -i http://169.254.169.254/hetzner/v1/metadata/availability-zone
HTTP/1.1 404 Not Found
Server: fasthttp
Date: Fri, 18 Feb 2022 18:45:54 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 27

availability-zone not found

This is a problem, since this call is used in the csi-driver where it always gets back the response availability-zone not found and from that parses, that the availability zone itself is called availability instead of for example nbg1.

Note: Other endpoints like http://169.254.169.254/hetzner/v1/metadata or http://169.254.169.254/hetzner/v1/metadata/public-ipv4 work just fine.

LKaemmerling commented 2 years ago

Hey @choffmeister,

the endpoint only works on newer servers (and servers that are not created from a snapshot). Therefore it works as expected :)

choffmeister commented 2 years ago

@LKaemmerling The server is new (created yesterday) but created from a snapshot (we are using Talos for K8s).

That is a bummer. Will this change? Because if it stays that way I would mean, that we always have to run on a forked csi-driver. Which is not too big of a problem, but I just wonder, why that is. It will also mean, that if someone recovers from a snapshot, it is not working like it was working before recovery.

Edit: Just found out, that talos can be installed without starting from snapshot (though it is faster). So thanks for pointing me in the right direction. Still would be super interesting to know, why servers from snapshots cannot get the availability zone (I guess they live somewhere just as a freshly created server :smile:)

LKaemmerling commented 2 years ago

@choffmeister because we need to make it backwards compatible. We do not know if the snapshot the server is created from already has the new cloud unit datasource, we only know this from our own system images.

choffmeister commented 2 years ago

That makes a lot of sense. Thanks for sharing! Will find a way to bootstrap our k8s nodes without using snapshots.

sergelogvinov commented 2 years ago

@choffmeister because we need to make it backwards compatible. We do not know if the snapshot the server is created from already has the new cloud unit datasource, we only know this from our own system images.

In my opinion, we are already broke the backwards compatibility. Because before "snapshot-server" has access the meta server.

Many could providers add an option to switch on/off the meta server at create time.

LKaemmerling commented 2 years ago

The metadata service is accessible, just some fields are missing for older servers.

choffmeister commented 2 years ago

Though I wonder: Would it really break anything if new endpoints (my understanding is that /region and such are completely new endpoints) are visible to old servers? Should not be a problem or am I missing something?

sergelogvinov commented 2 years ago

I am little bit confuse. As I know Hetzner Cloud does not have its own Kubernetes as service solution. But has very good CCM/CSI plugins.

And now, those plugins work only with a few Hetzner os-images. And you cannot make pre build images base on Hetzner images either.

This is looks like vendor lock. Very sad news. Very sad decision...

choffmeister commented 2 years ago

@sergelogvinov The plugins still work fine if the installation process is adjusted. But it is indeed more complicated now (especially if you have many servers that you want to bootstrap), as you have to always start from a known Hetzner base image and then do a live in-place installation. For example this works out fine for what we use (Talos):

log "Creating server in rescue mode..."
hcloud server create --name ${NODE_NAME} \
  --image debian-11  \
  --type ${SERVER_TYPE} \
  --ssh-key ${SSH_KEY} \
  --location ${HCLOUD_LOCATION} \
  --user-data-from-file ${NODE_CONFIG} \
  --start-after-create=false
hcloud server enable-rescue ${NODE_NAME} --ssh-key ${SSH_KEY}
hcloud server poweron ${NODE_NAME}
cat << EOF | hcloud server ssh ${NODE_NAME}
cd /tmp
wget -O /tmp/talos.raw.xz https://github.com/talos-systems/talos/releases/download/v0.14.2/hcloud-amd64.raw.xz
xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync
EOF
hcloud server shutdown ${NODE_NAME}
hcloud server poweron ${NODE_NAME}

But @LKaemmerling, what I still would like to understand if you could be so kind:

What exactly is the issue with new (previous not existing) endpoints to be reachable for all servers? Old servers can still use the old existing endpoint. And regardless if a new or an old server is asking: Every server is in some region and hence should be able to request that information. This is external information that is unrelated to which image is currently running on a server.
Will this stay as it is right now, or is it to be stay like this forever?

omBratteng commented 2 years ago

@LKaemmerling the CSI driver has the hcloud API token, would it be possible for the CSI driver to then, instead of querying the availability-zone metadata endpoint, to query the instance-id to get the instance ID, and then do a API call to Get a Server, which includes information about the datacenter.

{
    "..."
    "datacenter": {
        "id": 3,
        "name": "hel1-dc2",
        "description": "Helsinki 1 DC 2",
        "location": {
            "id": 3,
            "name": "hel1",
            "description": "Helsinki DC Park 1",
            "country": "FI",
            "city": "Helsinki",
            "latitude": 60.169855,
            "longitude": 24.938379,
            "network_zone": "eu-central"
        },
        "...",
    },
    "...",
}

I think that would be backwards compatible?

omBratteng commented 2 years ago

And I see the CSI driver is already getting the instance-id, just before getting the availability-zone https://github.com/hetznercloud/csi-driver/blob/main/cmd/node/main.go#L27-L31

hetznercloud / hcloud-go

Availability zone endpoint does not work #192