hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.84k stars 1.95k forks source link

Event stream job payload returns wrong type #11367

Open djenriquez opened 2 years ago

djenriquez commented 2 years ago

Nomad version

$ nomad version
Nomad v1.1.5 (117a23d2cdf26a1837b21c84f30c8c0f3441e927)

Operating system and Environment details

Amazon Linux 2

Issue

Using the Go SDK, when processing a Job topic event, we often run into an issue where the SDK fails to marshal the Job event:

failed to create job events: failed to read job: 1 error(s) decoding:

* 'Job.Payload': source data must be an array or slice, got string

Looking at the payload, the event returns the following:

"Payload": "AA==",

The full event otherwise has a good structure, the payload is just off.

Go mod dep set at github.com/hashicorp/nomad/api v0.0.0-20210920221949-6d126ac53cbc

Reproduction steps

Use the SDK to subscribe to the event stream with a Job topic filter and process events returned from the stream.

Expected Result

A proper Payload returned instead of a string. If there is an issue with the event, an error should be returned with the event (api.Event.Err).

Actual Result

A string Payload is returned with no errors reported.

DerekStrickland commented 2 years ago

Hi @djenriquez ,

Thanks for using Nomad. I'm sorry to hear you are having an issue. I'm curious, is it the same job(s) that throw the error each time, or is there no pattern really? If it's the same job(s) could you post the jobspec?

kainoaseto commented 2 years ago

Hey @DerekStrickland I work on the same team as DJ and we're working through this issue together. I noticed that it looks like this is on dispatched parameterized batch jobs specifically. Unfortunately we don't have a job spec to repro exactly since we generate these but I can try to strip out the IP from it and get something setup. The event that is returned with (seemingly utf-16 encoding issues) for context looks like this:

    "Topic": "Job",
    "Type": "AllocationUpdated",
    "Key": "parameterized-batch-job-name/dispatch-1634850017-930d248c",
    "FilterKeys": null,
    "Index": 12201153,
    "Payload": {
        "Job": {
            "Affinities": null,
            "AllAtOnce": false,
            "Constraints": null,
            "ConsulNamespace": "",
            "ConsulToken": "",
            "CreateIndex": 12200900,
            "Datacenters": ["dc1"],
            "DispatchIdempotencyToken": "",
            "Dispatched": true,
            "ID": "parameterized-job-name/dispatch-1634850017-930d248c",
            "JobModifyIndex": 12200900,
            "Meta": {
                "VAR1": "VAL1",
                "VAR2": "",
                "VAR3": "VAL3"
            },
            "ModifyIndex": 12201153,
            "Multiregion": null,
            "Name": "parameterized-batch-job-name/dispatch-1634850017-930d248c",
            "Namespace": "default",
            "NomadTokenID": "",
            "ParameterizedJob": {
                "MetaOptional": ["opt1", "opt2", "opt3"],
                "MetaRequired": ["req1", "req2", "req3"],
                "Payload": "forbidden"
            },
            "ParentID": "parameterized-batch-job-name",
            "Payload": "AA==",
            "Periodic": null,
            "Priority": 50,
            "Region": "us-east-1",
            "Spreads": null,
            "Stable": false,
            "Status": "dead",
            "StatusDescription": "",
            "Stop": false,
            "SubmitTime": 1634850017353593000,
            "TaskGroups": [{
                "Affinities": null,
                "Constraints": [{
                    "LTarget": "${node.class}",
                    "Operand": "=",
                    "RTarget": "nomad-client-cluster"
                }, {
                    "LTarget": "${attr.vault.version}",
                    "Operand": "semver",
                    "RTarget": "\u003e= 0.6.1"
                }, {
kainoaseto commented 2 years ago

It looks like there's some unicode encoding errors and I wonder if that's the source of the problem here where the Payload is null (or the empty set is being encoded oddly)

DerekStrickland commented 2 years ago

That's exactly the path I am heading down. As best as I can tell, there is only one line that formats a message that way, and it has to do with parsing jobspecs as you might imagine. Also, the AA== string really looks like an encoding issue. Glad we're on the same path.

Here the code in question:

func parseFile(path string) (*hcl.File, hcl.Diagnostics) {
    body, err := ioutil.ReadFile(path)
    if err != nil {
        return nil, hcl.Diagnostics{
            &hcl.Diagnostic{
                Severity: hcl.DiagError,
                Summary:  "Failed to read file",
                Detail:   fmt.Sprintf("failed to read %q: %v", path, err),
            },
        }
    }

    return parseHCLOrJSON(body, path)
}

Interestingly, that error seems to be thrown by ioutil, which it seems does not support utf-16. Assuming I'm right, and this is the only line that formats an error that way, it will take a code change to detect and handle utf-16. Are you confident your file is utf-16 encoded?

kainoaseto commented 2 years ago

Ah ok that's seeming like it's very likely the issue then. To be honest this seems like some mismatch internal to nomad, I don't think it's file related specifically but I could be overlooking something. This error happens whenever the already submitted job to Nomad is being executed. It appears that it's a periodic job and that job definition in the Nomad UI has the areas above that look like a encoding issue appearing properly:

Nomad Job definition in UI:

{
          "LTarget": "${attr.vault.version}",
          "RTarget": ">= 0.6.1",
          "Operand": "semver"
        },

Nomad event with job definition from event stream:

{
    "LTarget": "${attr.vault.version}",
    "Operand": "semver",
    "RTarget": "\u003e= 0.6.1"
}, {

The top level Payload key in the job definition is filled out like so:

"Dispatched": false,
  "DispatchIdempotencyToken": "",
  "Payload": null,
  "Meta": null,
  "ConsulToken": "",
  "ConsulNamespace": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "running",
  "StatusDescription": "",
  "Stable": false,
  "Version": 4,
  "SubmitTime": 1627671812651961900,
  "CreateIndex": 1776621,
  "ModifyIndex": 10040553,
  "JobModifyIndex": 10040553

Where it seems like the Payload is null and I wonder if there's some odd encoding/decoding issue that causes this null to turn into AA==.

kainoaseto commented 2 years ago

I'm not sure if that line above is being called during a new periodic job dispatch but I wouldn't be totally surprised since each periodic job dispatch creates a new job (so I could see that being implemented by pulling that job file from memory or file, parsing it, and resubmitting with updated fields).

DerekStrickland commented 2 years ago

Were your able to confirm that your file is in fact utf-16?

danlsgiga commented 2 years ago

Same thing happening for me on Nomad 1.2.6 and SDK on v0.0.0-20220422170747-b1ce39297285.

In my case, whenever a parameterized job with the following block or no payload included is submitted, the SDK returns the error mentioned in this issue.

parameterized {
  payload = "forbidden"
}

* 'Job.Payload': source data must be an array or slice, got string

And when looking at the map structure returned, I see Payload:AA== as well. AA== is the value of a 0 byte encoded to base64, which I think in Go, translates to null (?) when the SDK is expecting to see an empty slice

lgfa29 commented 2 years ago

Thanks for the extra info @danlsgiga, that's super helpful.