hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Vault interpolation in vault stanza (original failure handling fixed) #1956

Open lethalpaga opened 7 years ago

lethalpaga commented 7 years ago

Nomad version

Nomad v0.5.0-rc1 ('a8c8199e413d387021a15d7a1400c8b8372124d6+CHANGES')

Issue

Hi all,

I tried to use interpolation in the vault policies list, and found a couple of issues: When I ran the job, the task got stuck in the received state. The server log shows that the interpolation isn't performed (would be nice to have), so nomad wasn't able to fetch the vault token. In the case where the vault token can't be retrieved, shouldn't the allocation be marked as failed straight away?

» nomad alloc-status 71faeda1
ID                 = 71faeda1
Eval ID            = 610afb08
Name               = r53-backup.backup-script[0]
Node ID            = 3d4e49a1
Job ID             = r53-backup
Client Status      = complete
Client Description = <none>
Created At         = 11/08/16 15:17:52 NZDT

Task "r53-backup" is "dead"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
100 MHz  128 MiB  0 B   0     

Recent Events:
Time                    Type      Description
11/08/16 15:25:09 NZDT  Killed    Task successfully killed
11/08/16 15:17:56 NZDT  Received  Task received by client

Nomad Server logs

Nov 08 04:20:32 nomad-server1 nomad[4970]:     2016/11/08 04:20:32.291945 [ERR] nomad.node: Vault token creation failed: failed to create token for task "r53-backup": Error making API request.
Nov 08 04:20:32 nomad-server1 nomad[4970]: URL: POST https://vault-experiment.acme.com/v1/auth/token/create/nomad-server
Nov 08 04:20:32 nomad-server1 nomad[4970]: Code: 400. Errors:
Nov 08 04:20:32 nomad-server1 nomad[4970]: * token policies ([aws-${nomad_meta_account}-r53 default]) must be subset of the role's allowed policies ([aws-test-r53 default nomad-server])

Job file

Excerpt:

  task "r53-backup" {
    meta {
      account = "test"
    }
    driver = "docker"
    config {
      image = "r53-backup:latest"

      args = ["/local/backup", "s3://${NOMAD_META_ACCOUNT}-network"]

      volumes = [
         "${NOMAD_TASK_DIR}/secrets/.aws:/root/.aws"
      ]
    }

    vault {
      policies = ["aws-${NOMAD_META_ACCOUNT}-r53"]
      env = false
    }
  }
dadgar commented 7 years ago

Hey thanks for filing this. Fixed the unrecoverable errors issue! The interpolation issue is much more difficult to tackle. So it will have to wait for after 0.5.0. I am going to update the issue title to reflect that it is now just around interpolation

camerondavison commented 6 years ago

any update on this?

nvx commented 2 years ago

Just got bitten by this. HCL2 templating at least works in it, but you can't rely on any runtime options like node metadata still.

nvx commented 2 years ago

Just looking at how difficult this would be to implement.

Looks like the request to Vault is done by the server here: https://github.com/hashicorp/nomad/blob/8a427a470a5779e756ebb144b6970bd41f252311/nomad/vault.go#L996-L1008 which is called by https://github.com/hashicorp/nomad/blob/8a427a470a5779e756ebb144b6970bd41f252311/nomad/node_endpoint.go#L1594

Notably in the latter we have access to the structs.Node struct which is not passed into CreateToken, but looks like it could easily be added to support this.

The structs.Node struct is used to build the node attributes, currently in two different places, one for the task execution here (running on the node): https://github.com/hashicorp/nomad/blob/8a427a470a5779e756ebb144b6970bd41f252311/client/taskenv/env.go#L798-L816 And again here for scheduling (running on the server): https://github.com/hashicorp/nomad/blob/8a427a470a5779e756ebb144b6970bd41f252311/scheduler/feasible.go#L748-L781

Neither of which seems overly reusable for this as-is (noting that the server is what requests the token and determines the policies to include, not the node).

There's also the case of the environment variables, of which some would pose a chicken and egg situation (the VAULT_TOKEN environment variable comes to mind, as would any set via templates), but in theory we should be able to support things like NOMAD_*_NAME, NOMAD_*_ID, NOMAD_META_*, etc come to mind as they are known pretty early on. Again there doesn't seem to be a good place to access these from the server though as they're all tightly integrated in env.go much like the node attributes.

Would it make sense refactoring the code to pull out the bits that build the node attributes and easily derived NOMAD* environment variables into a separate package that takes in eg a structs.Node to return the node attributes map (this could either be a member func of the Node struct, or in a separate package), and similar for deriving the NOMAD* variables given a structs.Allocation and the task name?

It seems fairly straightforward, but due to the amount of refactoring required I'd rather seek input on where the appropriate place would be to move those funcs to before doing up a PR and potentially having to refactor it again.

tgross commented 1 year ago

I know this is a very old issue but leaving a note to say that https://github.com/hashicorp/nomad/issues/15617 will deprecate (and remove in 1.9) the workflow that sends the Vault request via the server, and that'll potentially allow us to implement changes like this one finally.