Open kkm000 opened 2 years ago
@rileykarson, thank you so much for putting this into the dev timeline, that's super! I think that I can write this change by myself, it's quite a simple addition. I have a question about the right way to contribute, tho. I tried to absorb the contrib guides in this and the terraform-provider-google-beta
repos, and that in the Magic Modules. The principal distinction is the one between autogenerated and handwritten resources. The attached_disk resource is handwritten, and will probably stay that way for a while, since there is no CRUD counterpart in the GCE API. Also, both the GCE API and the change are identical for compute/v1 and compute/beta.
In my understanding, for such a case the correct sequence of steps is:
Did I get the process right?
Rather than making the changes here first, I'd recommend making them against https://github.com/GoogleCloudPlatform/magic-modules/ from the onset. We're able to upstream changes to that repo, but it's less work overall if they start there in the first place.
@rileykarson, I admit I don't grok how PRs go through the CI test pipelines here when they originate in the MM repo (they has to be committed there before getting here, where the tests are run), but I'll do as you prefer, naturally! As I think of it, something has to fly above my head anyway, otherwise the Modules wouldn't be called Magic in the first place... :))
This repo only runs unit tests as a presubmit, where there's ~100 unit tests and 1800 acceptance tests in the repo. In the MM repo we run the unit tests and acceptance tests based on the change that will get generated by your changes to the MM repo using a mixture of a few different systems. The downside is that it's low visibility for contributors, unfortunately (it'd be too easy to extradite credentials otherwise) so you're a little more reliant on your reviewer to share results than would be ideal.
To the service team:
To put this ticket succinctly, as explained in it's summary. The request to add the fields boot={true|false}
and auto_delete = {true|false}
to google_compute_attached_disk
.
Community Note
FR Summary
Add arguments
boot={true|false}
andauto_delete = {true|false}
to thegoogle_compute_attached_disk
resource. Document it as mindbogglingly-advanced, but please, please add it!I'm in dire straits due to an unannounced change in either the GCE API or the (then-alpha) handling of boot disks by the GCP Deployment Manager (GDM).
Problem Exposition
I must elaborate my case very clearly for it to be convincing. Sorry, but that got to be a long and sad story.
It is very well possible that I do not know how to Terraform. It is also possible that Terraform is not the right tool for the task. But I have previously used Google DM, when it was even in Alpha, with more or less success; now, that GCP chops off feature by feature from it, I'm forced to search for a replacement. Terraform is very flexible, in the sense that you may figure a sequence of infra changes outside of Terraform, importing them to the state, and then running a plan that is not going to kill and re-create unnecessary stuff. This does not sound like a praise, but other IaC software is much more rigid. So this is a praise: albeit TF is not a Holy Grail of IaC, it is as close to (if you're an optimist) or is the least far from (otherwise) what would you ideally want. After looking at multiple option, I want to stick with it.
Most of complexity—raise your hand if you truly believe in heart that a module culminating in a “provisioner "local-exec" { command "gcloud ...." }” should require a module with 16 (sixteen, one six, do you copy, Huston?) intricately interdependent
null_resource
s—probably comes from the limitations imposed for a good reason. But this source looks like it's author fought against TF and finally won the battle, despite solemnly burying a hundred of their most precious mates—the remaining hours of their only earthly life.My code is going this direction. I'm crying for mercy: unlike a real battle, a reinforcement will never come. Every heroically fallen hour is gone, and this is it. And what I'm implementing is exactly a variant of the
google_compute_attached_disk
resource with a--boot
flag via gcloud local-exec calls.New or Affected Resource(s)
google_compute_attached_disk
I am maintaining a computational cluster. The cluster has 3 types of nodes: login nodes, which users log on to, and two other types that I'm lumping together for the purpose of this question/FR. A login node's boot is supposed to be maintained as if it were a hardware machine, i. e. with the distro's package manager. The other nodes' boot disks are never maintained, but instead simply replaced from a new image, monthly or if any changes needed. In my architecture, boot disks contain only the system and NVIDIA drivers, but all cluster management and problem domain computation software lives on a different disk, one per cluster, mounted READ_ONLY to every node (including a formidable flock of preemptible nodes spawned by the cluster controller).
Thus, the OS software domain is separated from the problem software domain by design. A cluster may be created for a different HPC work with a different problem-domain software disk shared by all nodes (akin to the traditionally used
modules
software in a hardware cluster), except that in the cloud and with preemptible workers, it's more natural to deploy a separate cluster for a job type: jobs do not have to wait for days for hardware resources to become available and then configure their nodes with the right software, since there is an unlimited supply of virtual hardware, which can be configured in the right VM shape for a job.Previously, I simply detached and killed boot disks by running GDM with a variable like '--properties=with_boot_disks=false', then recreated and attached them with '--properties=with_boot_disks=true', like (GDM ninja code snippet, abbreviated)
This is a part of GDM that broke, and prompted me to search for a replacement. Note, however, that if something went wrong, the successfully completed 2-phase process is idempotent: if the disk failed to detach, it would on a successful run; if it was detached and killed successfully, the first phase, with !BOOT, was a no-op.
The breakage is possibly related to the Compute API change. In the new approach, which I traced through audit logs, GCP API simply sends an POST request for the VM, and with !BOOT the request does not contain a boot disk, which results in a cryptic API error which may be an API bug (or a fool-proofing restrictions, to which I'd object that the users would come up with a better fool) to the effect that the resulting VM configuration is inconsistent w.r.t. guestOsFeatures UEFI_COMPATIBLE and VIRTIO_SCSI_MULTIQUEUE. This makes not much sense, as these are properties only of the boot disk, in turn gained from the image it's created from, and aren't those of an instance. But pragmatically, it is what it is, and the ball to work around the bogosity that has been newly introduced by the GCP is in my court. I am going to file a bug, but I cannot wait for its resolution—that's assuming it is a bug and not a five-ton feature dropped straight down on me without any warning.
To summarize. I want to replace a boot disk of an instance with a freshly imaged one without recreating the instance.
It is not possible with the terraform-provider-google. If I place an
image=
into theboot
block, the instance has to be recreated if I change it image source. If I create a boot disk as a separate resource and refer to it in theboot
block of the instance, it is still recreated. Recreating an instance may not be a big deal, as these are expendable, but here I run into a different issue. Before reimaging, I must be absolutely sure that changes made outside of TF are refreshed back into the state. For example, the cluster controller metadata contains 3 essential items (the image to create preemptible node, the problem SW disk and, the most important, the cluster controller configuration file, as a tarred-gzipped-base64 encoded item that must be preserved across the disk replacement). Users sometimes change them outside of TF. Besides, the cluster owner must be able to do the reimaging. TF is designed more as a tool for an IT person than a IT-inane number cruncher, so I have to write super-careful scripts. In this case, analysing the proposed -refresh-only plan would be too complex. Yes I can read it in as JSON, but cannot write a reliable analyser that would aye or nay to the refresh, handling all corner cases. Users mangle their setups, so there could be changes that must be remembered in the TF state, and those that must be reverted based on the state. These must be untangled, and I do not want to write a program that deals with this concoction of tightballed spaghetti with a superglue sauce.Potential Terraform Configuration
What looks to this naive me as a simple change would have almost solved my problem in almost one line: new argument of the
google_compute_attached_disk
resource,boot = true
(the "almost" part is the request 2 below, for theauto_delete
argument). Yes, it's an advanced usage, but the wholegoogle_compute_attached_disk
resource is documented as advanced, and it's documentation full of yellow warning boxes. Put more warnings the user to the user that this is advanced property of an advanced resource, but pray, please, please, please add it! It's already in the GCE API.Add two arguments to the
google_compute_attached_disk
resource:boot={true|false}
. Optional, default is, naturally, false.auto_delete = {true|false}
. Optional. Probably allowed even if boot= is not provided (GCE allows auto_delete on non-boot disks). The default for the special caseboot = true
only is up to your judgment. The GCE's own default isfalse
in all cases, and it seems uncontroversial for non-boot disk attachment. One possibility is deviate from GCE and default this totrue
if and only ifboot = true
, but it may be contrary to the provider's design philosophy, so this is only a weak suggestion.false
is that this is the GCE API default in all cases.true
in this special case is, which-either way an instance acquire it's boot disk, be it via theboot_disk
block or via thegoogle_compute_attached_disk
resource withboot = true
, the attached disk propertyauto_delete
would be consistentlytrue
by default.A modified example from The Example Usage section of the documentation
It seems to me that by a plan with
-replace google_compute_disk.control_boot
all the magic should happen. Actually, I even tested that, and it so tantalizingly near from working (as long as the boot disk is not named "boot"—I had to rename it "sandal" for the test), except the provider reattaches the newly reimaged replacement disk withboot =false
, quite naturally; the debug showed a POST request to thehttps://compute.googleapis.com/compute/.../attachDisk
endpoint with the payload of just two arguments{ "deviceName":"sandal", "source":"zones/us-east1-c/disks/yes-control-boot" }
. Detaching and manually attaching it with gcloud as a--boot
disk worked just fine, and TF was even unsurprised (nothing but the minor discrepancies, which have tracking issues here, and which I usually clean up by refreshing).Additional Notes
The GCP API does support detaching and attaching the boot disk
...provided the instance is TERMINATED.
The
=>>
mark indicates command output, for easier reading.Note that the boot disk is always lifted up to the index 0, so that detaching and reattaching a boot disk does not shuffle the disk order.
A fool-proofing help from GCE API: name
boot
is reservedApparently, the
device_name = boot
is special, although I don't recall seeing it documented anywhere. Ifdevice_name
is "boot" butboot
is nottrue
(false or omitted), the GCP API operation is accepted, but completes with an error:References
Looking; will add.
b/322601800