error during artifact download: landlock failed to lock (NixOS) #18721

Open cottand opened 11 months ago

cottand commented 11 months ago

Nomad version

Output from nomad version


Operating system and Environment details

Nixos 23.05 with additional package nomad 1.6.1

# uname --all                                                                                                                                                                                                                                                     
Linux cosmo 6.1.53 #1-NixOS SMP PREEMPT_DYNAMIC Wed Sep 13 07:43:05 UTC 2023 x86_64 GNU/Linux


Downloading artifacts fails the task with some landlock error

Randomly guessing: go-getter or Nomad require some binary dependency not in the PATH of Nomad (can happen with NixOS since there are less defaults for installed binaries)

Reproduction steps

Simply using artifact { and running the job

Expected Result

Artifact downloaded as usual

Actual Result

Task failed and the logs above

Job file (if appropriate)

job "prometheus" {
  datacenters = ["dc1"]
  type        = "service"

  group "monitoring" {
    count = 1

    network {
      mode = "bridge"
      dns {
        servers = [...]
      port "http" {}

    constraint {
      attribute = "${attr.nomad.bridge.hairpin_mode}"
      value     = true

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"

    ephemeral_disk {
      size    = 256 # MB
      migrate = true
      sticky  = true

    task "prometheus" {
      template {
        change_mode = "restart"
        destination = "local/prometheus.yml"

        data = <<EOH
  scrape_interval:     30s
  evaluation_interval: 30s

  - /local/mimir_rules.yaml

      artifact {
        source      = "https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml"
        mode        = "file"
        destination = "local/mimir_rules.yaml"

      driver = "docker"

      config {
        image = "prom/prometheus:latest"

        volumes = [

        args = [

        ports = ["http"]

      service {
        name     = "prometheus"
        provider = "nomad"
        port     = "http"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.677+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.728+0100 [ERROR] client.artifact: sub-process: OUTPUT="failed to sandbox artifact-isolation process: landlock failed to lock: no such file or directory"
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.729+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus type="Failed Artifact Download" msg="failed to download artifact \"https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml\": getter subprocess failed: exit status 1" failed=false
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.730+0100 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus error="prestart hook \"artifacts\" failed: failed to download artifact \"https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml\": getter subprocess failed: exit status 1"
shoenig commented 11 months ago

Hi @Cottand, you can see the file paths Nomad is trying to restrict with landlock in https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/getter/util_linux.go

Note that we already try to avoid locking on anything that does not exist so it is suprising to see landlock failed to lock: no such file or directory".

cottand commented 11 months ago

I am very unfamiliar with landlock, but NixOS makes extensive use of soft and hard links.

None of the paths in the file you linked stand out to me like NixOS-specific, but I am not an expert on the inner workings of the distro either

root@x /# ls -l /usr/local/bin                                                                                                                                                                                  2
ls: cannot access '/usr/local/bin': No such file or directory
root@x /# ls -l /usr/bin                                                                                                                                                                                        2
total 4
lrwxrwxrwx 1 root root 65 Oct 11 11:24 env -> /nix/store/j4fwy5gi1rdlrlbk2c0vnbs7fmlm60a7-coreutils-9.1/bin/env
root@x /# ls -l /bin
total 4
lrwxrwxrwx 1 root root 75 Oct 11 11:24 sh -> /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/sh
root@x /# ls -l /

maybe this will be useful

acaloiaro commented 11 months ago

This may be related to how Nix "generations" and symlinks work.

When Nix systems switch to new generations, a bunch of symlinks get updated to point to new locations.

NixOS provides an option called environment.pathsToLink, and a common value for it is environment.pathsToLink = [ "/libexec" ];.

This makes /libexec point to the "current generation" of the system at any point in time. So if a user runs nixos-rebuild switch, and the inode that the symlink previously pointed at is collected with nix-collect-garbage (or some other means) and if /libexec's inode was cached anywhere, it will now point to an inaccessible inode. And I see that landlock appears to be locking /libexec.

It's possible that an inode that landlock is attempting to lock no longer exists.

cottand commented 11 months ago

I do not have that particular option set (and I do not have /libexec nor /usr/libexec) but I do have other landlocked files symlinked that are part of my config, specifically, the ssh config:

# ls -la /etc/ssh

lrwxrwxrwx  1 root root   22 Oct 12 10:52 moduli -> /etc/static/ssh/moduli
lrwxrwxrwx  1 root root   26 Oct 12 10:52 ssh_config -> /etc/static/ssh/ssh_config
lrwxrwxrwx  1 root root   27 Oct 12 10:52 sshd_config -> /etc/static/ssh/sshd_config
lrwxrwxrwx  1 root root   31 Oct 12 10:52 ssh_known_hosts -> /etc/static/ssh/ssh_known_hosts
tgross commented 11 months ago

Nobody's mentioned it yet, but I do want to point out the workaround which is to disable the Landlock filesystem isolation for artifacts via the agent's client.disable_filesystem_isolation option. That'll at least get folks unblocked while this discussion goes on.

koalalorenzo commented 5 months ago

Also running on NixOS and disabling the filesystem isolation is mitigating the issue for now :(