Unable to run provision again on aws

jessereynolds commented 1 year ago

Describe the Bug

On a previously built pecdm, after pulling down latest commits to main, I'm unable to re-run the provision. (I am not sure if I am supposed to be able to re-run a provision, or if I've missed some steps after pulling down the latest commits).

jesse@Control-Surface puppetlabs-pecdm % bolt module install --no-resolve
Installing project modules

  → Syncing modules from /Users/jesse/src/puppet/tf/puppetlabs-pecdm/Puppetfile
    to /Users/jesse/src/puppet/tf/puppetlabs-pecdm/.modules

  → Generating type references

Metadata for task 'node_manager::update_classes' contains unknown keys: summary. This could be a typo in the task metadata or might result in incorrect behavior. [ID: unknown_task_metadata_keys]
Successfully synced modules from /Users/jesse/src/puppet/tf/puppetlabs-pecdm/Puppetfile to /Users/jesse/src/puppet/tf/puppetlabs-pecdm/.modules

jesse@Control-Surface puppetlabs-pecdm % bolt plan run pecdm::provision --params @params.json
Starting: plan pecdm::provision
Input Puppet Enterprise console password now or accept default. [puppetlabs]:
Starting: plan pecdm::subplans::provision
Starting: task terraform::initialize on localhost
Finished: task terraform::initialize with 0 failures in 0.18 sec
Starting infrastructure provisioning for a standard deployment of Puppet Enterprise
Starting: plan terraform::apply
Starting: task terraform::apply on localhost
Finished: task terraform::apply with 0 failures in 14.66 sec
Starting: task terraform::refresh on localhost
Finished: task terraform::refresh with 1 failure in 0.2 sec
Finished: plan terraform::apply in 14.86 sec
Finished: plan pecdm::subplans::provision in 16.1 sec
Finished: plan pecdm::provision in 37.2 sec
Failed on localhost:
  The task failed with exit code 1 and no stdout, but stderr contained:
  /tmp/2f905ef8-91de-4759-bee1-8f3bc00a5e0c/terraform/tasks/refresh.rb:31:in `<main>': uninitialized constant TerraformOutput (NameError)
Failed on 1 target: localhost
Ran on 1 target

Expected Behavior

Running the provision plan again should just verify the deployed infrastructure is as it should be and correct anything incorrect or not set up yet

Steps to Reproduce

Build a standardalone PE server on aws with pecdm with params similar to the below, on a version of pecdm from 15 Nov 2022

{
    "project"        : "jr-ape",
    "version"        : "2021.7.1",
    "architecture"   : "standard",
    "cluster_profile": "development",
    "provider"       : "aws",
    "cloud_region"   : "ap-southeast-2",
    "firewall_allow" : [ "10.0.0.1/32" ],
    "ssh_pub_key_file" : "/Users/jesse/.ssh/id_rsa.pub",
    "extra_terraform_vars" : {
    "tags" : {
      "lifetime" : "2d",
      "department" : "PS",
      "created_by" : "jesse.reynolds@puddle.example"
    }
  }
}

Update to f17ada9bdb47b0155f024e7389106b4ae1f967d2

Stop and start the PE instance (new IP addr and internal dns)

Update the bolt project modules as above.

Run the provision plan as above.

Environment

Running on macos 12.6.1
terraform 1.3.4

Additional Context

I could be doing multiple daft things based on incorrect assumptions.

I noticed that I could not use pecdm::destroy prior to updating pecdm as it gave the following error:

jesse@Control-Surface puppetlabs-pecdm % bolt plan run pecdm::destroy provider=aws
Starting: plan pecdm::destroy
Starting: plan pecdm::subplans::destroy
Destroying Puppet Enterprise deployment on aws
Starting: task terraform::initialize on localhost
Finished: task terraform::initialize with 0 failures in 0.17 sec
Starting: plan terraform::destroy
Starting: task terraform::destroy on localhost
Finished: task terraform::destroy with 1 failure in 6.99 sec
Finished: plan terraform::destroy in 7.0 sec
Finished: plan pecdm::subplans::destroy in 8.01 sec
Finished: plan pecdm::destroy in 8.02 sec
Failed on localhost:

  Error: Error in function call

    on modules/networking/main.tf line 7, in locals:
     7:   vpc_id        = try(aws_vpc.pe[0].id, data.aws_vpc.existing[0].id)
      ├────────────────
      │ while calling try(expressions...)
      │ aws_vpc.pe is empty tuple
      │ data.aws_vpc.existing is empty tuple

  Call to function "try" failed: no expression succeeded:
  - Invalid index (at modules/networking/main.tf:7,33-36)
    The given key does not identify an element in this collection value: the collection has no elements.
  - Invalid index (at modules/networking/main.tf:7,62-65)
    The given key does not identify an element in this collection value: the collection has no elements.

  At least one expression must produce a successful result.
Failed on 1 target: localhost
Ran on 1 target

I've tried the following commands and they work as expected (no errors):

(cd .terraform/aws_pe_arch && terraform state list)
(cd .terraform/aws_pe_arch && terraform refresh)

But after this the provision gives a different error and now I really feel I've probably taken multiple bad paths:

jesse@Control-Surface puppetlabs-pecdm % bolt plan run pecdm::provision --params @params.json
Starting: plan pecdm::provision
Input Puppet Enterprise console password now or accept default. [puppetlabs]:
Starting: plan pecdm::subplans::provision
Starting: task terraform::initialize on localhost
Finished: task terraform::initialize with 0 failures in 0.18 sec
Starting infrastructure provisioning for a standard deployment of Puppet Enterprise
Starting: plan terraform::apply
Starting: task terraform::apply on localhost
Finished: task terraform::apply with 1 failure in 34.38 sec
Finished: plan terraform::apply in 34.39 sec
Finished: plan pecdm::subplans::provision in 35.51 sec
Finished: plan pecdm::provision in 37.02 sec
Failed on localhost:

  Error: Error import KeyPair: InvalidKeyPair.Duplicate: The keypair 'pe_adm_14e8e3' already exists.
    status code: 400, request id: 1583fd38-b90a-4b26-a810-85c8979f40c8

    with module.instances.aws_key_pair.pe_adm,
    on modules/instances/main.tf line 54, in resource "aws_key_pair" "pe_adm":
    54: resource "aws_key_pair" "pe_adm" {

Failed on 1 target: localhost
Ran on 1 target

nigelkersten commented 1 year ago

Jesse, we're going to have @ody have a look at this when he's back from PTO next week

ody commented 1 year ago

@jessereynolds Sorry about this. This was my fault. I introduced a mistake in the puppetlabs-terraform module, updated pecdm's Puppetfile to that ref then noticed the mistake and fixed the puppetlabs-terraform module but forgot to update pecdm's Puppetfile again. This change will fix this issue https://github.com/puppetlabs/puppetlabs-pecdm/commit/cc5575db9e64e9fdb27f740199233275eb52e7de

jessereynolds commented 1 year ago

No worries @ody ! Thank you for the explanation.

So a re-provision is something that I should expect to work?

ody commented 1 year ago

@jessereynolds PEADM is not idempotent so if it fails as some point during the deployment process, re-running it can either fail or finish "successfully" with an unknown list of misconfigurations. If you use PECDM to destroy infrastructure and provision then PEADM will finish correctly.

PECDM does not currently check the status of PE on nodes so it is not capable of determining the previous attempt by PEADM to deploy failed and automatically re-provisioning infrastructure. Feasible and interesting idea but also dangerous. Could imaging scenarios where people accidentally destroy environments before they were prepared to do so.

puppetlabs / puppetlabs-pecdm