[Bug]: Server creation from the same snapshot sometimes takes an abnormal time

frederic-arr commented 3 months ago

What happened?

When creating several servers from snapshots, the provider sometimes waits an additional minute in the "Still creating..." state despite the server being ready for a while.

# Omitted
hcloud_server.arm_small[2]: Creation complete after 1m9s [id=50992996]
hcloud_server.arm_small[0]: Creation complete after 1m9s [id=50992998]
hcloud_server.arm_small[1]: Creation complete after 1m10s [id=50992999]
hcloud_server.arm_small[4]: Still creating... [1m10s elapsed]
hcloud_server.arm_small[3]: Still creating... [1m10s elapsed]
# The servers are ready soon after this message
# Omitted
hcloud_server.arm_small[4]: Creation complete after 2m11s [id=50992997]
hcloud_server.arm_small[3]: Creation complete after 2m11s [id=50993000]

What did you expect to happen?

The creation should be marked as complete as soon as possible (without exceeding the API rate limit).

Please provide a minimal working example

terraform {
  required_providers {
    hcloud = {
      source  = "hetznercloud/hcloud"
      version = "1.48.0"
    }
  }
}

variable "hcloud_token" {
  type      = string
  sensitive = true
}

variable "image_id_arm" {
  type = string
}

provider "hcloud" {
  token = var.hcloud_token
}

resource "hcloud_server" "arm_small" {
  name        = "arm-small-${count.index}"
  image       = var.image_id_arm
  server_type = "cax11"
  location    = "nbg1"
  public_net {
    ipv4_enabled = false
    ipv6_enabled = true
  }

  count = 5
}

var.image_id_arm is the image id of my snapshot. The snapshot was made in NBG1 and is 350MB. It is made with the following packer file:

# hcloud.pkr.hcl

packer {
  required_plugins {
    hcloud = {
      source  = "github.com/hetznercloud/hcloud"
      version = "~> 1"
    }
  }
}

variable "talos_version" {
  type    = string
  default = "v1.7.5"
}

variable "arch" {
  type    = string
  default = "arm64"
}

variable "server_type" {
  type    = string
  default = "cax11"
}

variable "server_location" {
  type    = string
  default = "nbg1"
}

locals {
  image = "https://github.com/siderolabs/talos/releases/download/${var.talos_version}/hcloud-${var.arch}.raw.xz"
}

source "hcloud" "talos" {
  rescue       = "linux64"
  image        = "debian-12"
  location     = "${var.server_location}"
  server_type  = "${var.server_type}"
  ssh_username = "root"

  snapshot_name   = "talos system disk - ${var.arch} - ${var.talos_version}"
  snapshot_labels = {
    type    = "infra",
    os      = "talos",
    version = "${var.talos_version}",
    arch    = "${var.arch}",
  }
}

build {
  sources = ["source.hcloud.talos"]

  provisioner "shell" {
    inline = [
      "apt-get install -y wget",
      "wget -O /tmp/talos.raw.xz ${local.image}",
      "xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync",
    ]
  }
}

apricote commented 3 months ago

The API client currently uses an exponential backoff while waiting for actions. The version in the provider does not use a max time, so it can get to a point where there are minutes between each check.

We have added an alternative "truncated" backoff in a recent version of hcloud-go: https://github.com/hetznercloud/hcloud-go/pull/459

Unfortunately the hcloud-go in the terraform provider is stuck on the v1 branch (see #877).

For now you can switch to a constant backoff by modifying your provider:

provider "hcloud" {
  token = var.hcloud_token

  poll_function = "constant"
  poll_interval = "5s" # default is 500ms, which might cause rate-limit issues with the constant polling
}

github-actions[bot] commented 6 days ago

This issue has been marked as stale because it has not had recent activity. The bot will close the issue if no further action occurs.

hetznercloud / terraform-provider-hcloud