oxidecomputer / terraform-provider-oxide

Oxide Terraform provider
Mozilla Public License 2.0
20 stars 3 forks source link

Disk attached to an instance should honor the order of disks passed from yaml #338

Closed askfongjojo closed 1 month ago

askfongjojo commented 2 months ago

Preliminary checks

What was the expected behaviour

In a customer-reported issue, we saw that the disk defined in the disk_attachments array for an instance end up in un-deterministic slot numbers:

resource "oxide_instance" "mysql" {
  count            = 8

  project_id       = data.oxide_project.default.id
  name             = "mysql-${count.index}"
  host_name        = "mysql-${count.index}"
  description      = "MySQL Instance"

  ncpus            = 4
  memory           = 17179869184
  ssh_public_keys  = [
    data.oxide_ssh_key.default.id,
  ]
  disk_attachments = [
    oxide_disk.mysql_boot[count.index].id,
    oxide_disk.mysql_data[count.index].id,
  ]
root@[fd00:1122:3344:116::3]:32221/omicron> select instance.name, attach_instance_id, disk.time_created, slot, disk.id, disk.name from disk, instance where disk.attach_instance_id = instance.id and instance.name like 'mysql%' order by instance.name, slot;
   name   |          attach_instance_id          |         time_created          | slot |                  id                  |     name
----------+--------------------------------------+-------------------------------+------+--------------------------------------+---------------
  mysql-0 | 9cbad5a2-ec1f-4657-a5e7-b1b503851fdd | 2024-08-21 22:37:22.544718+00 |    0 | 5c49ad6a-c061-4293-8682-88e72f82d690 | mysql-boot-0
  mysql-0 | 9cbad5a2-ec1f-4657-a5e7-b1b503851fdd | 2024-08-21 22:37:27.773181+00 |    1 | 7e6bf05c-7124-46a8-b752-5a08d0cfb278 | mysql-data-0
  mysql-1 | 74bf845c-a424-4875-b786-0d5ae2068c36 | 2024-08-21 22:37:28.586991+00 |    0 | 16f490a9-4f5b-4e76-8f2f-570fe733002a | mysql-data-1
  mysql-1 | 74bf845c-a424-4875-b786-0d5ae2068c36 | 2024-08-21 22:37:25.562642+00 |    1 | 52dafbc7-f494-4181-b071-98bce1dce955 | mysql-boot-1
  mysql-2 | 9f14cb01-73d4-4587-baca-deccdc220b27 | 2024-08-21 22:37:28.582404+00 |    0 | edbbae16-292d-4f15-8893-3662054e87b3 | mysql-data-2
  mysql-2 | 9f14cb01-73d4-4587-baca-deccdc220b27 | 2024-08-21 22:37:26.669946+00 |    1 | fc24f595-7b94-4f08-871c-e8c91404ca3e | mysql-boot-2
  mysql-3 | 8c3f908d-bce1-49bd-af78-63b796a5b71f | 2024-08-21 22:37:22.54628+00  |    0 | 0f4691a1-ddb0-4898-a3c7-b09614ecffe1 | mysql-boot-3
  mysql-3 | 8c3f908d-bce1-49bd-af78-63b796a5b71f | 2024-08-21 22:37:22.537804+00 |    1 | 686d5ce5-ab1b-4542-8812-2abda8181075 | mysql-data-3
  mysql-4 | 9c148ff0-801f-4ae1-abee-445cc0b16db0 | 2024-08-21 22:37:22.543865+00 |    0 | 4777f980-9be8-4376-a6da-7731b2a9f5a9 | mysql-boot-4
  mysql-4 | 9c148ff0-801f-4ae1-abee-445cc0b16db0 | 2024-08-21 22:37:22.54681+00  |    1 | 648847b3-41a3-4a6d-88da-e0c65b3ea8e8 | mysql-data-4
  mysql-5 | 06cdb670-a358-4881-bd70-6306a02636fb | 2024-08-21 22:37:22.552336+00 |    0 | 4e763163-5899-45c8-8994-b7628a97c469 | mysql-boot-5
  mysql-5 | 06cdb670-a358-4881-bd70-6306a02636fb | 2024-08-21 22:37:22.511421+00 |    1 | 7131af02-f631-4d1a-86f1-65f4961cbce2 | mysql-data-5
  mysql-6 | c5d678d4-3e4e-428c-8455-d9cfa5aa1e95 | 2024-08-21 22:37:22.545522+00 |    0 | 1d9b5c66-d6f4-41bb-8cbf-eaf4c6a3959f | mysql-boot-6
  mysql-6 | c5d678d4-3e4e-428c-8455-d9cfa5aa1e95 | 2024-08-21 22:37:22.525998+00 |    1 | ea7dae53-2b14-44d0-bf8a-e85ebb88f80e | mysql-data-6
  mysql-7 | 75239f31-f136-4751-8c95-e7259fb86172 | 2024-08-21 22:37:22.541243+00 |    0 | 5159fa8f-1678-457e-8c7a-63becaa5872b | mysql-data-7
  mysql-7 | 75239f31-f136-4751-8c95-e7259fb86172 | 2024-08-21 22:37:23.812351+00 |    1 | 5c4b8dca-a4f0-4e04-b98c-c3e75690c468 | mysql-boot-7

The first disk in the array (the boot disk) is supposed to be assigned to slot 0. But as seen above, for the 8 instances created with the TF provider, they appear to assign the slot based on the disk uuid sort order which is consistent with the customer's observations.

What is the current behaviour and what actions did you take to get there

TF provider needs to honor the order in which the disks are passed in the array when passing them to the API.

Provider version

0.3.0

Terraform version

1.4.4

Operating system

MacOS

Anything else you would like to add?

No response

askfongjojo commented 2 months ago

In my case, my instances still booted fine as the bootloader was somehow able to locate the boot disk in spite of the swapped slot numbers. I wasn't able to totally replicate the customer issue but until RFD 470 is implemented, the only way to allow the system to try the intended boot disk first is by using the disk name order.

cc @twinfees

karencfv commented 2 months ago

Sadly, this is not something that can easily be fixed behind the scenes.

The main issue here is that after Terraform creates the resources it performs a read to verify that the state of the resources matches the tf plan (the HCL configuration that was passed). This works pretty easily for most resources, but due to the shape of the instance related endpoints we had to get a bit more creative with this resource and ended up cobbling up several endpoints to create an instance resource.

This all means that after the instance and corresponding disks are created I have to call two (or more if there is a network interface) endpoints to retrieve information about the resources. The GET instance endpoint does not retrieve information about the disks. So I need to call the GET instance disks endpoint to retrieve information about the associated disks. This list of disks is NOT in the same order as when the instance was created, so there is no way of knowing which disk was the boot disk from that list.

The way I see it there are two ways of fixing this:

I'm not sure which of the two is a preferable outcome for our customers. WDYT @askfongjojo @twinfees?

karencfv commented 2 months ago

Also possibly related https://github.com/oxidecomputer/omicron/issues/5112

askfongjojo commented 2 months ago
  • Make up a "boot_disk" attribute on terraform and perform some behind the scenes manipulation of the retrieved data to make it match the state of the resource. This approach has the benefit of being a faster fix, but has a very big drawback of deviating a lot from the endpoints themselves, become a potentially brittle resource, and most likely cause the customer's TF configurations to break in a future release.

Given the boot order work, the second approach you mentioned probably makes more sense. I was hoping that we could fix the client to work around the issue (pretty much like how we did it in the CLI) but the long-term fix of having an explicitly designed boot disk will be more worthwhile.

karencfv commented 1 month ago

Updating this issue. After a chat at the Product Huddle we've decided to wait for the fix in the API for this one

karencfv commented 1 month ago

Related https://github.com/oxidecomputer/omicron/pull/6585