hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.57k stars 1.92k forks source link

Panic on nomad startup after upgrading to 1.8.1 #23385

Closed tomqwpl closed 1 day ago

tomqwpl commented 2 weeks ago

Nomad version

Output from nomad version Nomad v1.8.1 BuildDate 2024-06-19T06:43:57Z Revision 5022543e4b7b8dcec9df123f86630ae3fdcffbe6

Operating system and Environment details

Windows 11

Issue

Nomad panics on startup having upgraded to 1.8.1, where it didn't before. Previously we were using 1.7.5.

Reproduction steps

Right now I haven't reduced this to a minimal configuration. Our configuration is reasonably complicated. I'm hoping that the stack trace might point things in the right direction.

Expected Result

No panic.

Actual Result

Panic:

goroutine 1287 [running]:
github.com/hashicorp/nomad/client/lib/idset.(*Set[...]).Slice(0xc001083d50?)
    github.com/hashicorp/nomad/client/lib/idset/idset.go:126 +0x1a
github.com/hashicorp/nomad/plugins/base.nomadTopologyToProto(0xc000a60120)
    github.com/hashicorp/nomad/plugins/base/base.go:171 +0x45
github.com/hashicorp/nomad/plugins/base.(*AgentConfig).toProto(0xc000e9c2e0)
    github.com/hashicorp/nomad/plugins/base/base.go:96 +0x47
github.com/hashicorp/nomad/plugins/base.(*BasePluginClient).SetConfig(0xc000e7ae60, 0xc000fe8ff0)
    github.com/hashicorp/nomad/plugins/base/client.go:63 +0x28
github.com/hashicorp/nomad/helper/pluginutils/loader.(*PluginLoader).Dispense(0xc0011083f0, {0xc001110258, 0x12}, {0x34cc4ff, 0x6}, 0xc000e9c2e0, {0x3dbc610, 0xc001060210})
    github.com/hashicorp/nomad/helper/pluginutils/loader/loader.go:186 +0x3d5
github.com/hashicorp/nomad/helper/pluginutils/singleton.(*SingletonLoader).dispense(0x0?, 0xc000cb7080, {0xc001110258?, 0x0?}, {0x34cc4ff?, 0x0?}, 0x0?, {0x3dbc610?, 0xc001060210?})
    github.com/hashicorp/nomad/helper/pluginutils/singleton/singleton.go:109 +0x53
created by github.com/hashicorp/nomad/helper/pluginutils/singleton.(*SingletonLoader).getPlugin in goroutine 272
    github.com/hashicorp/nomad/helper/pluginutils/singleton/singleton.go:85 +0x39d

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

pkazmierczak commented 2 weeks ago

Hi @tomqwpl, thanks for reporting this. Could you at least share the plugins config for your agents?

tomqwpl commented 2 weeks ago

Nomad run as server and client combined, so like "dev" mode.

Partial config:

client {
  enabled = true
  template {
    disable_file_sandbox = true
  }
}

server {
  enabled          = true
  bootstrap_expect = 1
  }

plugin "raw_exec" {
  config {
    enabled = true
  }
}

plugin_dir = "our-custom-plugin-dir"

plugin "our-custom-plugin-name" {
    config {
       our-cusotm-plugin-settings
    }
}

Bunch of other stuff that may or may not be relevant but which I don't really want to post here.

tomqwpl commented 2 weeks ago

Minimal config:

data_dir = "somewhere"

client {
  enabled = true
}

plugin_dir = "our-nomad-nomad-dir"

plugin "our-nomad-plugin" {
}

Running nomad with nomad agent -config xxx.hcl

Unclear whether the plugin itself matters, but that's not something I can share.

Anything changed in the requirements of a plugin in 1.8? I didn't see anything in the release notes.

pkazmierczak commented 2 weeks ago

Anything changed in the requirements of a plugin in 1.8?

Can't think of anything but I'll look into it. Do you by any chance use the same config for any OS other than Windows? If yes, do you also have issues there, or is it limited to Windows?

tomqwpl commented 2 weeks ago

Same config used on Linux, but I haven't verified whether the same issue occurs on Linux yet.

tomqwpl commented 2 weeks ago

Doesn't appear to be an issue on 1.8.0. So appears to be new in 1.8.1.

I'm attempting to try out on Linux.

tomqwpl commented 2 weeks ago

Same config appears to work OK on Linux, so this appears to be a Windows only issue

tomqwpl commented 2 weeks ago

I can catch it in the debugger, but I don't know enough about nomad to know what here would be wrong:

func (s *Set[T]) Slice() []T 

s is nil

func (st *Topology) GetNodes() *idset.Set[hw.NodeID] {

returns nil because SupportsNUMA returns false (platform is Windows)

That appears to be the surface level error.

func (s *SingletonLoader) dispense(f *future, name, pluginType string, config *base.AgentConfig, logger log.Logger) {

the plugin name here is my custom plugin.

If I run the code at 1.8.0, then in nomadTopologyToProto:

NodeIds:                helper.ConvertSlice(top.NodeIDs.Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),

here top.NodeIDs isn't nil, so this is fine.

The same line in 1.8.1 is:

    NodeIds:                helper.ConvertSlice(top.GetNodes().Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),

The GetNodes method returns nil because of SupportsNuma being false. So this issue appears to have been caused by commit 2eda8d1889fefa35bbb7e30c7ede6a57268f06b4?

pkazmierczak commented 2 weeks ago

Wow, thanks for the investigation @tomqwpl, I'll get right on it.

pkazmierczak commented 2 weeks ago

hey @tomqwpl, I'm really struggling to reproduce the issue. No matter what I do I can't get my Nomad dev build (or the official Nomad 1.8.1) to panic. I think I found some buggy code path that I fixed here: https://github.com/hashicorp/nomad/pull/23399 but again—without knowing your plugin code it's very hard for me to test. Do you mind trying that branch on your plugin configuration? If you could share whether that affects your panic that'd be a good data point for me. Sorry for getting back to you so late.

tomqwpl commented 2 weeks ago

Hmm. Interesting. I didn't imagine it would be plugin code specific. It'll take me a couple of days to get to that, so you might have to bear with me. I might strip down a plugin and see if it makes a difference what plugin I use.

tomqwpl commented 5 days ago

@pkazmierczak Yes, the lines of code you removed are the cause of the nil. So with that version I no longer get a panic.

pkazmierczak commented 5 days ago

@tomqwpl good stuff! I'll add some tests to cover that code path on windows, and we'll try to release this in 1.8.2