hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.92k stars 1.95k forks source link

return validation error (400) and not (500) on invalid UUIDs #18433

Closed EugenKon closed 1 year ago

EugenKon commented 1 year ago

Nomad version

$ nomad version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

$ uname -a
Linux ip-172-31-12-145 6.2.0-1010-aws #10~22.04.1-Ubuntu SMP Wed Aug 16 18:00:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

Error querying allocation: Unexpected response code: 500 (rpc error: alloc lookup failed: index error: Invalid UUID: encoding/hex: invalid byte: U+0073 's')

Reproduction steps

$ nomad job plan backup.hcl
+ Job: "backup"
+ Task Group: "db-backup" (1 create)
  + Task: "postgres-backup" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.
- If submitted now, next periodic launch would be at 2023-09-08T14:02:00-04:00 (36s from now).

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 backup.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
$ nomad job run -check-index 0 backup.hcl
Job registration successful
Approximate next launch time: 2023-09-08T14:02:00-04:00 (30s from now)

Expected Result

No error message

Actual Result

index error: Invalid UUID: encoding/hex: invalid byte: U+0073 's') I can not figure out from where it comes? Please note, I have specified service, but I do not have alloc=xxxx tag

Job file (if appropriate)

job "backup" {
  datacenters = ["dc1"]
  type = "batch"

  periodic {
    enabled   = true
    cron      = "*/1 * * * * *"
    time_zone = "America/Toronto"
    prohibit_overlap = true
  }

  # https://awesomeprogrammer.com/blog/2022/06/05/how-to-backup-postgres-database-with-nomad/
  group "db-backup" {
    task "postgres-backup" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["local/script.sh"]
      }

      template {
        destination = "local/script.sh"
        data        = <<-EOH
          set -e

          nomad alloc exec -task postgres-task $DB_ALLOC_ID bash -c 'is_ready 120; backup'
        EOH
      }

      template {
        env         = true
        destination = "secrets/file.env"
        data = <<-EOH
          # here you also might want to set NOMAD_TOKEN env
          # if you're using ACL capabilities

          # As service 'postgres-node' is registered in Consul, we want to grab its 'alloc' tag
          {{- range $tag, $services := service "postgres-node" | byTag -}}
            {{if $tag | contains "alloc"}}
              {{$allocId := index ($tag | split "=") 1}}
              DB_ALLOC_ID="{{ $allocId }}"
            {{end}}
          {{end}}
        EOH
      }
    }
  }

}

Nomad Server logs (if appropriate)

No logs available via web interface: image

Sep 08 19:02:02 ip-172-31-6-178 nomad[929]: http: request complete: method=GET path=/v1/agent/health?type=server duration=2.866022ms
Sep 08 19:02:11 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:11.588Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:35778
Sep 08 19:02:11 ip-172-31-6-178 nomad[929]: nomad: memberlist: Stream connection from=127.0.0.1:35778
Sep 08 19:02:12 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:12.697Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="419
.556µs"
Sep 08 19:02:12 ip-172-31-6-178 nomad[929]: http: request complete: method=GET path=/v1/agent/health?type=server duration="419.556µs"
Sep 08 19:02:21 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:21.591Z [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:42134
Sep 08 19:02:21 ip-172-31-6-178 nomad[929]: nomad: memberlist: Stream connection from=127.0.0.1:42134
Sep 08 19:02:22 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:22.700Z [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration="393
.391µs"
Sep 08 19:02:22 ip-172-31-6-178 nomad[929]: http: request complete: method=GET path=/v1/agent/health?type=server duration="393.391µs"
Sep 08 19:02:24 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:24.011Z [ERROR] http: request failed: method=GET path=/v1/allocations?prefix=ax error="alloc looku
p failed: index error: Invalid UUID: encoding/hex: invalid byte: U+0078 'x'" code=500
Sep 08 19:02:24 ip-172-31-6-178 nomad[929]: http: request failed: method=GET path=/v1/allocations?prefix=ax error="alloc lookup failed: index error: Invalid UUID:
encoding/hex: invalid byte: U+0078 'x'" code=500
Sep 08 19:02:24 ip-172-31-6-178 nomad[929]:     2023-09-08T19:02:24.012Z [DEBUG] http: request complete: method=GET path=/v1/allocations?prefix=ax duration=1.468604ms
Sep 08 19:02:24 ip-172-31-6-178 nomad[929]: http: request complete: method=GET path=/v1/allocations?prefix=ax duration=1.468604ms

Nomad Client logs (if appropriate)

tgross commented 1 year ago

Hi @EugenKon! Your reproduction doesn't say where you're seeing the error, and the reproduction steps all look successful. Is this in response to a specific Nomad command after you've done that?

EugenKon commented 1 year ago

@tgross I got that error here: image

EugenKon commented 1 year ago

Also it would be nice to see:

processing template destination "secrets/file.env"
Success
processing template destination "local/script.sh"
Success
Running         command = "/bin/bash"   args    = ["local/script.sh"]
...

Or something similar. This will allow easy debugging.

EugenKon commented 1 year ago

I noticed same error when do on my local host:

nomad alloc exec -task postgres-task fake_ID ls -la /backup/
Error querying allocation: Unexpected response code: 500 (alloc lookup failed: index error: Invalid UUID: encoding/hex: invalid byte: U+006B 'k')

So error message belongs to nomad alloc exec -task postgres-task $DB_ALLOC_ID bash -c 'is_ready 120; backup' part.

EugenKon commented 1 year ago

Weird. Error message depends on which ID I provide:

$ nomad alloc exec -task postgres-task qw ls -la /backup/
Error querying allocation: Unexpected response code: 500 (alloc lookup failed: index error: Invalid UUID: encoding/hex: invalid byte: U+0071 'q')
$ nomad alloc exec -task postgres-task ew ls -la /backup/
Error querying allocation: Unexpected response code: 500 (alloc lookup failed: index error: Invalid UUID: encoding/hex: invalid byte: U+0077 'w')

And a couple of meaningful responses

$ nomad alloc exec -task postgres-task ab ls -la /backup/
No allocation(s) with prefix or id "ab" found
$ nomad alloc exec -task postgres-task ac ls -la /backup/
No allocation(s) with prefix or id "ac" found
EugenKon commented 1 year ago

More changes to ID lead me to next different error:

$ nomad alloc exec -task postgres-task ae ls -la /backup/
Could not find task named: postgres-task, found:
postgres-backup

image

EugenKon commented 1 year ago

weird =) I specified the task, but got error. Only when I remove -job flag it starts to work. But this is a different issue I think.

image

EugenKon commented 1 year ago

UPD. I added nomad server logs to the first message. journalctl -xeu nomad

tgross commented 1 year ago

Ok this log line tells us why you're seeing the error in the command line.

http: request failed: method=GET path=/v1/allocations?prefix=ax

The nomad alloc exec command docs show this usage:

nomad alloc exec [options] [...]

When you do a command like: nomad alloc exec -task postgres-task qw ls -la /backup/ you're asking for an allocation ID that starts with qw. Allocation IDs are always UUIDs, so you're giving the command something that will never succeed in looking up correctly.

But your initial report was about the UI, which should of course be sending a real allocation ID here to this API. Unfortunately I can't seem to reproduce that behavior here. Is it possible for you to grab the actual request being made in the browser developer tools?

EugenKon commented 1 year ago

Is it possible for you to grab the actual request being made in the browser developer tools?

How a browser belongs to nomad alloc exec command? I am not sure what do you mean.

tgross commented 1 year ago

Oh I'm sorry, I was definitely misunderstanding what you were seeing in the browser. The browser is showing you the logs from your allocation, which in turn is calling nomad alloc exec.

So the problem is in the query you're making to populate the DB_ALLOC_ID environment variable:

      template {
        env         = true
        destination = "secrets/file.env"
        data = <<-EOH
          # here you also might want to set NOMAD_TOKEN env
          # if you're using ACL capabilities

          # As service 'postgres-node' is registered in Consul, we want to grab its 'alloc' tag
          {{- range $tag, $services := service "postgres-node" | byTag -}}
            {{if $tag | contains "alloc"}}
              {{$allocId := index ($tag | split "=") 1}}
              DB_ALLOC_ID="{{ $allocId }}"
            {{end}}
          {{end}}
        EOH
      }

Are all the tags actually allocation UUIDs?

EugenKon commented 1 year ago

Sorry, probably I made poor explanation. I know, that a problem with alloc I did not set it up by intention. As I have found later, the problem is with nomad exec alloc. I do not expect 'Invalid UUID: encoding/hex: invalid byte:' error here at all.

Investigating into this problem I also found that some service, to which nomad exec alloc does a query failed with '500 error' or with Could not find task named: postgres-task, found: postgres-backup

The only expected message here is: No allocation(s) with prefix or id "ab" found where ab is ID of allocation. For other my queries with broken/not exist fake_ID, ae IDs, the service should not die and must return correct error messages: No allocation(s) with prefix or id "fake_ID" found

So for now the correct issue message should be: Service should not die with '500 internal service error' message when broken search string is passed. And it would be nice if nomad will return more debugging info to simplify debugging when such error messages occur.

tgross commented 1 year ago

So for now the correct issue message should be: Service should not die with '500 internal service error' message when broken search string is passed. And it would be nice if nomad will return more debugging info to simplify debugging when such error messages occur.

It's a validation error, and we should be returning 400 on that.

tgross commented 1 year ago

Dupe of https://github.com/hashicorp/nomad/issues/10476