hashicorp / nomad-driver-podman

A nomad task driver plugin for sandboxing workloads in podman containers
https://developer.hashicorp.com/nomad/plugins/drivers/podman
Mozilla Public License 2.0
226 stars 62 forks source link

nomad-driver-podman_0.4.2_linux_amd64.zip artifact has ELF architecture: EM_X86_64 #227

Closed jdoss closed 1 year ago

jdoss commented 1 year ago

I am finally upgrading to 0.4.2 and it seems the amd64 artifact is EM_X86_64. This impacts the nightly artifact too. Building from source results in the correct ELF architecture.

Apr 03 01:26:36 nomad[15887]:     2023-04-03T01:26:36.639Z [ERROR] agent: error starting agent:
Apr 03 01:26:36 nomad[15887]:   error=
Apr 03 01:26:36 nomad[15887]:   | failed to create plugin loader: failed to initialize plugin loader: failed to fingerprint plugins: 1 error occurred:
Apr 03 01:26:36 nomad[15887]:   | \t* Unrecognized remote plugin message:
Apr 03 01:26:36 nomad[15887]:   | This usually means
Apr 03 01:26:36 nomad[15887]:   |   the plugin was not compiled for this architecture,
Apr 03 01:26:36 nomad[15887]:   |   the plugin is missing dynamic-link libraries necessary to run,
Apr 03 01:26:36 nomad[15887]:   |   the plugin is not executable by this process due to file permissions, or
Apr 03 01:26:36 nomad[15887]:   |   the plugin failed to negotiate the initial go-plugin protocol handshake
Apr 03 01:26:36 nomad[15887]:   |
Apr 03 01:26:36 nomad[15887]:   | Additional notes about plugin:
Apr 03 01:26:36 nomad[15887]:   |   Path: /etc/nomad/plugins/nomad-driver-podman
Apr 03 01:26:36 nomad[15887]:   |   Mode: -rwxr-xr-x
Apr 03 01:26:36 nomad[15887]:   |   Owner: 0 [root] (current: 0 [root])
Apr 03 01:26:36 nomad[15887]:   |   Group: 0 [root] (current: 0 [root])
Apr 03 01:26:36 nomad[15887]:   |   ELF architecture: EM_X86_64 (current architecture: amd64)
Apr 03 01:26:36 nomad[15887]:   |
Apr 03 01:26:36 nomad[15887]:   |
Apr 03 01:26:36 nomad[15887]:   
shoenig commented 1 year ago

Hi @jdoss, where did you get the zip archive from? I just checked the one from releases and it looks fine:

➜ sha256sum nomad-driver-podman_0.4.2_linux_amd64.zip
bdf7c9f70c79d3d3055e73fdc6212a9bfc221ed824451be2d07b2c62ce4267c4  nomad-driver-podman_0.4.2_linux_amd64.zip

➜ unzip ./nomad-driver-podman_0.4.2_linux_amd64.zip
Archive:  ./nomad-driver-podman_0.4.2_linux_amd64.zip
  inflating: nomad-driver-podman

➜ file ./nomad-driver-podman
./nomad-driver-podman: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=tA-nyhXEkFEKjKMCrbWV/9xjUPy06QVQ82b4TatHu/Ra2CBcZm5LKHkBrSQw8R/30CfM25FK3BlHy4UB_3X, with debug_info, not stripped

➜ ./nomad-driver-podman
This binary is a plugin. These are not meant to be executed directly.
Please execute the program that consumes these plugins, which will
load any plugins automatically
jdoss commented 1 year ago

This is what I have in my automation:

ARG NOMAD_PODMAN_VERSION=0.4.2
# RUN curl -sL https://releases.hashicorp.com/nomad-driver-podman/${NOMAD_PODMAN_VERSION}/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
#   -o /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip && unzip /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
#   -d /etc/nomad/plugins
jdoss commented 1 year ago
$ curl -sL https://releases.hashicorp.com/nomad-driver-podman/${NOMAD_PODMAN_VERSION}/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip \
>   -o /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip && unzip /tmp/nomad-driver-podman_${NOMAD_PODMAN_VERSION}_linux_amd64.zip
Archive:  /tmp/nomad-driver-podman_0.4.2_linux_amd64.zip
  inflating: nomad-driver-podman     
$ file nomad-driver-podman
nomad-driver-podman: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=tA-nyhXEkFEKjKMCrbWV/9xjUPy06QVQ82b4TatHu/Ra2CBcZm5LKHkBrSQw8R/30CfM25FK3BlHy4UB_3X, with debug_info, not stripped
$ sha256sum nomad-driver-podman_0.4.2_linux_amd64.zip
bdf7c9f70c79d3d3055e73fdc6212a9bfc221ed824451be2d07b2c62ce4267c4  nomad-driver-podman_0.4.2_linux_amd64.zip

Uggg. I feel like I am taking crazy pills now.

tgross commented 1 year ago

What's even weirder is as far as I can tell from the spec EM_X86_64 is AMD64 in the ELF header. Probably worth verifying the checksum of the downloads though.

shoenig commented 1 year ago

Oh the arch message is probably a red-herring. Maybe the plugin is failing to startup because it doesn't like its configuration for some reason. @jdoss can you post your client plugin config for the podman driver?

jdoss commented 1 year ago

@shoenig It is a pretty standard config.

  config {
    socket_path = "unix://var/run/podman/podman.sock"
    volumes {
      enabled      = true
      selinuxlabel = "z"
    }
  }
}

I will switch back to the download from releases.hashicorp.com this afternoon and verify the checksum.

jdoss commented 1 year ago

Welp, I can't reproduce this anymore when switching back to releases.hashicorp.com. But now https://github.com/hashicorp/nomad-driver-podman/issues/228 is happening.

Procsiab commented 1 year ago

Hello there, I have two scenarios around my infrastructure where in case of the former I can reproduce this issue and in case of the latter I cannot:

1. Reproducible

The Nomad log shows an "Unrecognized plugin message" error and the same trace that @jdoss reported in the first message

In this case, on the system there are installed the following components:

2. Non-reproducible

The Nomad log shows no alerts after stopping Nomad, replacing the 0.4.1 plugin file with the 0.4.2 binary, and starting again Nomad (nome: the same procedure was performed for the above scenario)

The installed components in this system are:


I was also able to reproduce this issue (and not able to do that in the second scenario) even after a complete reboot of the checked systems.

Do you have any suggestion on other components to check or other tests to perform to nail down the cause?

Thanks in advance for your time.

Procsiab commented 1 year ago

Following up my previous message: I have updated the Podman version to 4.4.4 even on the system on which the issue is not reproducible, and still isn't. So, I started comparing Nomad's configurations between the systems that shows the issue and the one which doesn't, then I compared the Podman configurations: in any case, the systems are configured the same regarding Nomad, Podman and CGroups/CNI/overlayfs. The only other notable difference between these systems that I did not think about yesterday is that the one which I cannot reproduce the issue on is running on CentOS Stream 9, while the others have Fedora IoT installed on them.

To support the assumption that something specific to Fedora IoT is causing the plugin to fail to load, I set up two x86_64 VMs, one with Fedora Workstation 37 and another one with Fedora IoT 37.20230406.0; then I installed on them both the same Consul 1.15.1, Nomad 1.5.1, Podman 4.4.4 and copyed the same configuration files for Consul, Nomad and Podman.

I can reproduce this same behaviour reported by @jdoss on the Fedora IoT VM when loading the nomad-driver-podman 0.4.2 plugin, but not when loading the 0.4.1 one; for the other Fedora Workstation VM, both versions of this plugin are loaded without issues.

Do you have any suggestions on what to look for? In addition, I am using CGroups v2 and crun as OCI runtime

Procsiab commented 1 year ago

Through a binary search and subsequent testing among the commits between tags v0.4.1 and v0.4.2, I can reproduce the issue consistently starting from commit e0cadd19c7c46e2fc9aad88c6832b317214b3d5d . In contrast, from the commit 736560e3e459144bcc7370dbcf168ae7f37e01f6 and going backwards the issue is not reproducible anymore in the conditions I described in my messages above.

Hoping to not bother the author of the commit, I kindly ask the opinion of @lgfa29

jdoss commented 1 year ago

I think since @Procsiab is able to reproduce this issue, it might be a good idea to reopen it.

Procsiab commented 1 year ago

Since I tried also updating the Nomad binary in the meantime, I will post a recap of the component versions I am using on the system on which I am able to reliably reproduce this issue:

(*) These components are bundled at that specific version with the Fedora IoT deployment version I referred to as OS, and can be obtained by booting that specific deployment

Regarding configuration files, I am using driver = "overlay" for containers storage and unified cgroups for V2 support. Finally, the Nomad clients are run in a "rootless" fashion, as described in the official tutorial

lgfa29 commented 1 year ago

Through a binary search and subsequent testing among the commits between tags v0.4.1 and v0.4.2, I can reproduce the issue consistently starting from commit e0cadd1 . In contrast, from the commit 736560e and going backwards the issue is not reproducible anymore in the conditions I described in my messages above.

Hoping to not bother the author of the commit, I kindly ask the opinion of @lgfa29

That commit updates the Nomad dependency + Go to 1.20. Our build process is very similar to Nomad as well (we've been trying to standardize these things), so I would expect the Nomad binary to fail too 🤔

While we still try to reproduce this error would you mind testing some iterations?

I'm not sure if any of these will make a difference, but are some of the test cases I thought about. Unfortunately we can't test updating Nomad without bumping Go.

Procsiab commented 1 year ago

Thank you @lgfa29 for your suggestion: I tried both the scenarios you proposed and got the following results (on the same Fedora IoT 37.20230412.0 system):

Compile 736560e3e459144bcc7370dbcf168ae7f37e01f6 with Go 1.20

The Nomad client starts successfully and the Podman plugin is recognized without issues.

Compile e0cadd19c7c46e2fc9aad88c6832b317214b3d5d using CGO_ENABLED=1

The Nomad client starts successfully and the Podman plugin is recognized without issues.

So, wrapping up your observations, I would assume that I am not experiencing an issue because of the updated Go dependency, but more likely because of some library mismatch on the Fedora IoT OS.

I can also report that compiling the repository at the tag v0.4.2 with CGO_ENABLED=1 resolves the "Unrecognized remote plugin message" error on the plugin loading.

jdoss commented 1 year ago

I just tried the nightly out today with Nomad 1.6.0-beta.1 and I am still seeing issues on Fedora CoreOS 38.20230609.3.0. I pulled the latest main and complied it myself on my Fedora 38 Workstation which has go version go1.20.5 linux/amd64. This worked fine with Nomad 1.6.0-beta.1.

Fedora 38 moved it's GNU Toolchain to gcc 13.0, binutils 2.39, and glibc 2.37. https://fedoraproject.org/wiki/Changes/GNUToolchainF38

So this maybe a hunch here but if the Github Actions are using an older glibc and that is causing the issues here if CGO_ENABLED=1 is enabled on the build.

Does CGO_ENABLED=1 actually need to be enabled for this driver at all?

Jun 29 04:56:22 systemd[1]: Starting nomad.service - Nomad...
Jun 29 04:56:22 systemd[1]: Started nomad.service - Nomad.
Jun 29 04:56:23 nomad[6609]: ==> Loaded configuration from /etc/nomad/config/nomad.hcl
Jun 29 04:56:23 nomad[6609]: ==> Starting Nomad agent...
Jun 29 04:56:23 nomad[6609]: ==> Error starting agent: failed to create plugin loader: failed to initialize plugin loader: failed to fingerprint plugins: 1 error occurred:
Jun 29 04:56:23 nomad[6609]:         * Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]: This usually means
Jun 29 04:56:23 nomad[6609]:   the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]:   the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]:   the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]:   the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]: Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]:   Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]:   Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]:   Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]:     2023-06-29T04:56:23.080Z [ERROR] agent.plugin_loader: plugin process exited: plugin_dir=/etc/nomad/plugins path=/etc/nomad/plugins/nomad-driver-podman pid=6633 error="exit status 2"
Jun 29 04:56:23 nomad[6609]:     2023-06-29T04:56:23.081Z [WARN]  agent.plugin_loader: plugin failed to exit gracefully: plugin_dir=/etc/nomad/plugins
Jun 29 04:56:23 nomad[6609]:     2023-06-29T04:56:23.081Z [ERROR] agent.plugin_loader: failed to fingerprint plugin: plugin_dir=/etc/nomad/plugins plugin=nomad-driver-podman
Jun 29 04:56:23 nomad[6609]:   error=
Jun 29 04:56:23 nomad[6609]:   | Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]:   | This usually means
Jun 29 04:56:23 nomad[6609]:   |   the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]:   |   the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]:   |   the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]:   |   the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]:   |
Jun 29 04:56:23 nomad[6609]:   | Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]:   |   Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]:   |   Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]:   |   Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   |   Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   |   ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]:   
Jun 29 04:56:23 nomad[6609]:     2023-06-29T04:56:23.081Z [ERROR] agent: error starting agent:
Jun 29 04:56:23 nomad[6609]:   error=
Jun 29 04:56:23 nomad[6609]:   | failed to create plugin loader: failed to initialize plugin loader: failed to fingerprint plugins: 1 error occurred:
Jun 29 04:56:23 nomad[6609]:   | \t* Unrecognized remote plugin message:
Jun 29 04:56:23 nomad[6609]:   | This usually means
Jun 29 04:56:23 nomad[6609]:   |   the plugin was not compiled for this architecture,
Jun 29 04:56:23 nomad[6609]:   |   the plugin is missing dynamic-link libraries necessary to run,
Jun 29 04:56:23 nomad[6609]:   |   the plugin is not executable by this process due to file permissions, or
Jun 29 04:56:23 nomad[6609]:   |   the plugin failed to negotiate the initial go-plugin protocol handshake
Jun 29 04:56:23 nomad[6609]:   |
Jun 29 04:56:23 nomad[6609]:   | Additional notes about plugin:
Jun 29 04:56:23 nomad[6609]:   |   Path: /etc/nomad/plugins/nomad-driver-podman
Jun 29 04:56:23 nomad[6609]:   |   Mode: -rwxr-xr-x
Jun 29 04:56:23 nomad[6609]:   |   Owner: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   |   Group: 0 [root] (current: 0 [root])
Jun 29 04:56:23 nomad[6609]:   |   ELF architecture: EM_X86_64 (current architecture: amd64)
Jun 29 04:56:23 nomad[6609]:   |
Jun 29 04:56:23 nomad[6609]:   |
Jun 29 04:56:23 nomad[6609]:   
Jun 29 04:56:23 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 29 04:56:23 systemd[1]: nomad.service: Failed with result 'exit-code'.
shoenig commented 1 year ago

I suspect it's due to the outdated Nomad dependency where it still has this hard lookup of the nobody user in an init block.

https://github.com/hashicorp/nomad/blob/v1.5.0-beta.1/helper/users/lookup_unix.go

When compiled without CGO, this breaks systems like Fedora CoreOS which moved the nobody user behind an NSS lookup instead of keeping it in /etc/passwd, which is the only place the pure Go users package will look for users.

Let me pull in a fresh Nomad commit and see if it works.

Procsiab commented 1 year ago

I can report that https://github.com/hashicorp/nomad-driver-podman/pull/266 fixes this issue: I tested the build against an x86_64 VM and an ARM64 Raspberry Pi, both running the same Fedora IoT 38.20230630.0 deployment.