Closed jdoss closed 1 year ago
Hi @jdoss, where did you get the zip archive from? I just checked the one from releases and it looks fine:
➜ sha256sum nomad-driver-podman_0.4.2_linux_amd64.zip
bdf7c9f70c79d3d3055e73fdc6212a9bfc221ed824451be2d07b2c62ce4267c4 nomad-driver-podman_0.4.2_linux_amd64.zip
➜ unzip ./nomad-driver-podman_0.4.2_linux_amd64.zip
Archive: ./nomad-driver-podman_0.4.2_linux_amd64.zip
inflating: nomad-driver-podman
➜ file ./nomad-driver-podman
./nomad-driver-podman: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=tA-nyhXEkFEKjKMCrbWV/9xjUPy06QVQ82b4TatHu/Ra2CBcZm5LKHkBrSQw8R/30CfM25FK3BlHy4UB_3X, with debug_info, not stripped
➜ ./nomad-driver-podman
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically
This is what I have in my automation:
ARG NOMAD_PODMAN_VERSION=0.4.2
# RUN curl -sL https://releases.hashicorp.com/nomad-driver-podman/${NOMAD_PODMAN_VERSION}/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
# -o /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip && unzip /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
# -d /etc/nomad/plugins
$ curl -sL https://releases.hashicorp.com/nomad-driver-podman/${NOMAD_PODMAN_VERSION}/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
> -o /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip && unzip /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip
Archive: /tmp/nomad-driver-podman_0.4.2_linux_amd64.zip
inflating: nomad-driver-podman
$ file nomad-driver-podman
nomad-driver-podman: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=tA-nyhXEkFEKjKMCrbWV/9xjUPy06QVQ82b4TatHu/Ra2CBcZm5LKHkBrSQw8R/30CfM25FK3BlHy4UB_3X, with debug_info, not stripped
$ sha256sum nomad-driver-podman_0.4.2_linux_amd64.zip
bdf7c9f70c79d3d3055e73fdc6212a9bfc221ed824451be2d07b2c62ce4267c4 nomad-driver-podman_0.4.2_linux_amd64.zip
Uggg. I feel like I am taking crazy pills now.
What's even weirder is as far as I can tell from the spec EM_X86_64 is AMD64 in the ELF header. Probably worth verifying the checksum of the downloads though.
Oh the arch message is probably a red-herring. Maybe the plugin is failing to startup because it doesn't like its configuration for some reason. @jdoss can you post your client plugin config for the podman driver?
@shoenig It is a pretty standard config.
config {
socket_path = "unix://var/run/podman/podman.sock"
volumes {
enabled = true
selinuxlabel = "z"
}
}
}
I will switch back to the download from releases.hashicorp.com this afternoon and verify the checksum.
Welp, I can't reproduce this anymore when switching back to releases.hashicorp.com. But now https://github.com/hashicorp/nomad-driver-podman/issues/228 is happening.
Hello there, I have two scenarios around my infrastructure where in case of the former I can reproduce this issue and in case of the latter I cannot:
The Nomad log shows an "Unrecognized plugin message" error and the same trace that @jdoss reported in the first message
In this case, on the system there are installed the following components:
The Nomad log shows no alerts after stopping Nomad, replacing the 0.4.1 plugin file with the 0.4.2 binary, and starting again Nomad (nome: the same procedure was performed for the above scenario)
The installed components in this system are:
I was also able to reproduce this issue (and not able to do that in the second scenario) even after a complete reboot of the checked systems.
Do you have any suggestion on other components to check or other tests to perform to nail down the cause?
Thanks in advance for your time.
Following up my previous message: I have updated the Podman version to 4.4.4 even on the system on which the issue is not reproducible, and still isn't. So, I started comparing Nomad's configurations between the systems that shows the issue and the one which doesn't, then I compared the Podman configurations: in any case, the systems are configured the same regarding Nomad, Podman and CGroups/CNI/overlayfs. The only other notable difference between these systems that I did not think about yesterday is that the one which I cannot reproduce the issue on is running on CentOS Stream 9, while the others have Fedora IoT installed on them.
To support the assumption that something specific to Fedora IoT is causing the plugin to fail to load, I set up two x86_64 VMs, one with Fedora Workstation 37 and another one with Fedora IoT 37.20230406.0; then I installed on them both the same Consul 1.15.1, Nomad 1.5.1, Podman 4.4.4 and copyed the same configuration files for Consul, Nomad and Podman.
I can reproduce this same behaviour reported by @jdoss on the Fedora IoT VM when loading the nomad-driver-podman 0.4.2 plugin, but not when loading the 0.4.1 one; for the other Fedora Workstation VM, both versions of this plugin are loaded without issues.
Do you have any suggestions on what to look for? In addition, I am using CGroups v2 and crun
as OCI runtime
Through a binary search and subsequent testing among the commits between tags v0.4.1
and v0.4.2
, I can reproduce the issue consistently starting from commit e0cadd19c7c46e2fc9aad88c6832b317214b3d5d .
In contrast, from the commit 736560e3e459144bcc7370dbcf168ae7f37e01f6 and going backwards the issue is not reproducible anymore in the conditions I described in my messages above.
Hoping to not bother the author of the commit, I kindly ask the opinion of @lgfa29
I think since @Procsiab is able to reproduce this issue, it might be a good idea to reopen it.
Since I tried also updating the Nomad binary in the meantime, I will post a recap of the component versions I am using on the system on which I am able to reliably reproduce this issue:
37.20230412.0
(*) These components are bundled at that specific version with the Fedora IoT deployment version I referred to as OS
, and can be obtained by booting that specific deployment
Regarding configuration files, I am using driver = "overlay"
for containers storage and unified cgroups for V2 support. Finally, the Nomad clients are run in a "rootless" fashion, as described in the official tutorial
Through a binary search and subsequent testing among the commits between tags
v0.4.1
andv0.4.2
, I can reproduce the issue consistently starting from commit e0cadd1 . In contrast, from the commit 736560e and going backwards the issue is not reproducible anymore in the conditions I described in my messages above.Hoping to not bother the author of the commit, I kindly ask the opinion of @lgfa29
That commit updates the Nomad dependency + Go to 1.20. Our build process is very similar to Nomad as well (we've been trying to standardize these things), so I would expect the Nomad binary to fail too 🤔
While we still try to reproduce this error would you mind testing some iterations?
CGO_ENABLED=1
to match Nomad. If this works than there maybe something in the Go toolchain that is not matching the dynamic loaded libraries.I'm not sure if any of these will make a difference, but are some of the test cases I thought about. Unfortunately we can't test updating Nomad without bumping Go.
Thank you @lgfa29 for your suggestion: I tried both the scenarios you proposed and got the following results (on the same Fedora IoT 37.20230412.0
system):
The Nomad client starts successfully and the Podman plugin is recognized without issues.
CGO_ENABLED=1
The Nomad client starts successfully and the Podman plugin is recognized without issues.
So, wrapping up your observations, I would assume that I am not experiencing an issue because of the updated Go dependency, but more likely because of some library mismatch on the Fedora IoT OS.
I can also report that compiling the repository at the tag v0.4.2
with CGO_ENABLED=1
resolves the "Unrecognized remote plugin message" error on the plugin loading.
I just tried the nightly out today with Nomad 1.6.0-beta.1 and I am still seeing issues on Fedora CoreOS 38.20230609.3.0. I pulled the latest main
and complied it myself on my Fedora 38 Workstation which has go version go1.20.5 linux/amd64. This worked fine with Nomad 1.6.0-beta.1.
Fedora 38 moved it's GNU Toolchain to gcc 13.0, binutils 2.39, and glibc 2.37. https://fedoraproject.org/wiki/Changes/GNUToolchainF38
So this maybe a hunch here but if the Github Actions are using an older glibc and that is causing the issues here if CGO_ENABLED=1
is enabled on the build.
Does CGO_ENABLED=1
actually need to be enabled for this driver at all?
Jun 29 04:56:22 systemd[1]: Starting nomad.service - Nomad...
Jun 29 04:56:22 systemd[1]: Started nomad.service - Nomad.
Jun 29 04:56:23 nomad[6609]: ==> Loaded configuration from /etc/nomad/config/nomad.hcl
Jun 29 04:56:23 nomad[6609]: ==> Starting Nomad agent...
Jun 29 04:56:23 nomad[6609]: ==> Error starting agent: failed to create plugin loader: failed to initialize plugin loader: failed to fingerprint plugins: 1 error occurred:
Jun 29 04:56:23 nomad[6609]: * Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]: This usually means
Jun 29 04:56:23 nomad[6609]: the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]: the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]: the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]: the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]: Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]: Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]: Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]: Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]: 2023-06-29T04:56:23.080Z [ERROR] agent.plugin_loader: plugin process exited: plugin_dir=/etc/nomad/plugins path=/etc/nomad/plugins/nomad-driver-podman pid=6633 error="exit status 2"
Jun 29 04:56:23 nomad[6609]: 2023-06-29T04:56:23.081Z [WARN] agent.plugin_loader: plugin failed to exit gracefully: plugin_dir=/etc/nomad/plugins
Jun 29 04:56:23 nomad[6609]: 2023-06-29T04:56:23.081Z [ERROR] agent.plugin_loader: failed to fingerprint plugin: plugin_dir=/etc/nomad/plugins plugin=nomad-driver-podman
Jun 29 04:56:23 nomad[6609]: error=
Jun 29 04:56:23 nomad[6609]: | Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]: | This usually means
Jun 29 04:56:23 nomad[6609]: | the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]: | the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]: | the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]: | the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]: |
Jun 29 04:56:23 nomad[6609]: | Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]: | Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]: | Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]: | Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: | Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: | ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]:
Jun 29 04:56:23 nomad[6609]: 2023-06-29T04:56:23.081Z [ERROR] agent: error starting agent:
Jun 29 04:56:23 nomad[6609]: error=
Jun 29 04:56:23 nomad[6609]: | failed to create plugin loader: failed to initialize plugin loader: failed to fingerprint plugins: 1 error occurred:
Jun 29 04:56:23 nomad[6609]: | \t* Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]: | This usually means
Jun 29 04:56:23 nomad[6609]: | the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]: | the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]: | the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]: | the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]: |
Jun 29 04:56:23 nomad[6609]: | Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]: | Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]: | Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]: | Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: | Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]: | ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]: |
Jun 29 04:56:23 nomad[6609]: |
Jun 29 04:56:23 nomad[6609]:
Jun 29 04:56:23 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 29 04:56:23 systemd[1]: nomad.service: Failed with result 'exit-code'.
I suspect it's due to the outdated Nomad dependency where it still has this hard lookup of the nobody
user in an init
block.
https://github.com/hashicorp/nomad/blob/v1.5.0-beta.1/helper/users/lookup_unix.go
When compiled without CGO, this breaks systems like Fedora CoreOS which moved the nobody user behind an NSS lookup instead of keeping it in /etc/passwd
, which is the only place the pure Go users
package will look for users.
Let me pull in a fresh Nomad commit and see if it works.
I can report that https://github.com/hashicorp/nomad-driver-podman/pull/266 fixes this issue: I tested the build against an x86_64 VM and an ARM64 Raspberry Pi, both running the same Fedora IoT 38.20230630.0
deployment.
I am finally upgrading to 0.4.2 and it seems the amd64 artifact is EM_X86_64. This impacts the nightly artifact too. Building from source results in the correct ELF architecture.