Closed tomqwpl closed 1 day ago
Hi @tomqwpl, thanks for reporting this. Could you at least share the plugins config for your agents?
Nomad run as server and client combined, so like "dev" mode.
Partial config:
client {
enabled = true
template {
disable_file_sandbox = true
}
}
server {
enabled = true
bootstrap_expect = 1
}
plugin "raw_exec" {
config {
enabled = true
}
}
plugin_dir = "our-custom-plugin-dir"
plugin "our-custom-plugin-name" {
config {
our-cusotm-plugin-settings
}
}
Bunch of other stuff that may or may not be relevant but which I don't really want to post here.
Minimal config:
data_dir = "somewhere"
client {
enabled = true
}
plugin_dir = "our-nomad-nomad-dir"
plugin "our-nomad-plugin" {
}
Running nomad with nomad agent -config xxx.hcl
Unclear whether the plugin itself matters, but that's not something I can share.
Anything changed in the requirements of a plugin in 1.8? I didn't see anything in the release notes.
Anything changed in the requirements of a plugin in 1.8?
Can't think of anything but I'll look into it. Do you by any chance use the same config for any OS other than Windows? If yes, do you also have issues there, or is it limited to Windows?
Same config used on Linux, but I haven't verified whether the same issue occurs on Linux yet.
Doesn't appear to be an issue on 1.8.0. So appears to be new in 1.8.1.
I'm attempting to try out on Linux.
Same config appears to work OK on Linux, so this appears to be a Windows only issue
I can catch it in the debugger, but I don't know enough about nomad to know what here would be wrong:
func (s *Set[T]) Slice() []T
s is nil
func (st *Topology) GetNodes() *idset.Set[hw.NodeID] {
returns nil because SupportsNUMA returns false (platform is Windows)
That appears to be the surface level error.
func (s *SingletonLoader) dispense(f *future, name, pluginType string, config *base.AgentConfig, logger log.Logger) {
the plugin name here is my custom plugin.
If I run the code at 1.8.0, then in nomadTopologyToProto:
NodeIds: helper.ConvertSlice(top.NodeIDs.Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),
here top.NodeIDs isn't nil, so this is fine.
The same line in 1.8.1 is:
NodeIds: helper.ConvertSlice(top.GetNodes().Slice(), func(id hw.NodeID) uint32 { return uint32(id) }),
The GetNodes method returns nil because of SupportsNuma being false. So this issue appears to have been caused by commit 2eda8d1889fefa35bbb7e30c7ede6a57268f06b4?
Wow, thanks for the investigation @tomqwpl, I'll get right on it.
hey @tomqwpl, I'm really struggling to reproduce the issue. No matter what I do I can't get my Nomad dev build (or the official Nomad 1.8.1) to panic. I think I found some buggy code path that I fixed here: https://github.com/hashicorp/nomad/pull/23399 but again—without knowing your plugin code it's very hard for me to test. Do you mind trying that branch on your plugin configuration? If you could share whether that affects your panic that'd be a good data point for me. Sorry for getting back to you so late.
Hmm. Interesting. I didn't imagine it would be plugin code specific. It'll take me a couple of days to get to that, so you might have to bear with me. I might strip down a plugin and see if it makes a difference what plugin I use.
@pkazmierczak Yes, the lines of code you removed are the cause of the nil. So with that version I no longer get a panic.
@tomqwpl good stuff! I'll add some tests to cover that code path on windows, and we'll try to release this in 1.8.2
Nomad version
Output from
nomad version
Nomad v1.8.1 BuildDate 2024-06-19T06:43:57Z Revision 5022543e4b7b8dcec9df123f86630ae3fdcffbe6Operating system and Environment details
Windows 11
Issue
Nomad panics on startup having upgraded to 1.8.1, where it didn't before. Previously we were using 1.7.5.
Reproduction steps
Right now I haven't reduced this to a minimal configuration. Our configuration is reasonably complicated. I'm hoping that the stack trace might point things in the right direction.
Expected Result
No panic.
Actual Result
Panic:
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)