hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.83k stars 1.95k forks source link

error during artifact download: landlock failed to lock (NixOS) #18721

Open cottand opened 11 months ago

cottand commented 11 months ago

Nomad version

Output from nomad version

1.6.1

Operating system and Environment details

Nixos 23.05 with additional package nomad 1.6.1

# uname --all                                                                                                                                                                                                                                                     
Linux cosmo 6.1.53 #1-NixOS SMP PREEMPT_DYNAMIC Wed Sep 13 07:43:05 UTC 2023 x86_64 GNU/Linux

Issue

Downloading artifacts fails the task with some landlock error

Randomly guessing: go-getter or Nomad require some binary dependency not in the PATH of Nomad (can happen with NixOS since there are less defaults for installed binaries)

Reproduction steps

Simply using artifact { and running the job

Expected Result

Artifact downloaded as usual

Actual Result

Task failed and the logs above

Job file (if appropriate)

job "prometheus" {
  datacenters = ["dc1"]
  type        = "service"

  group "monitoring" {
    count = 1

    network {
      mode = "bridge"
      dns {
        servers = [...]
      }
      port "http" {}
    }

    constraint {
      attribute = "${attr.nomad.bridge.hairpin_mode}"
      value     = true
    }

    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      size    = 256 # MB
      migrate = true
      sticky  = true
    }

    task "prometheus" {
      template {
        change_mode = "restart"
        destination = "local/prometheus.yml"

        data = <<EOH
---
global:
  scrape_interval:     30s
  evaluation_interval: 30s

rule_files:
  - /local/mimir_rules.yaml

scrape_configs:
  # omitted for brevity

EOH
      }
      artifact {
        source      = "https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml"
        mode        = "file"
        destination = "local/mimir_rules.yaml"
      }

      driver = "docker"

      config {
        image = "prom/prometheus:latest"

        volumes = [
          "local/prometheus.yml:/etc/prometheus/prometheus.yml",
        ]

        args = [
          "--web.route-prefix=/",
          "--web.external-url=http://prometheus.traefik",
          "--config.file=/etc/prometheus/prometheus.yml",
          "--enable-feature=agent",
          "--web.enable-remote-write-receiver",
          "--enable-feature=exemplar-storage"
        ]

        ports = ["http"]
      }

      service {
        name     = "prometheus"
        provider = "nomad"
        port     = "http"

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.677+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus type="Downloading Artifacts" msg="Client is downloading artifacts" failed=false
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.728+0100 [ERROR] client.artifact: sub-process: OUTPUT="failed to sandbox artifact-isolation process: landlock failed to lock: no such file or directory"
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.729+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus type="Failed Artifact Download" msg="failed to download artifact \"https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml\": getter subprocess failed: exit status 1" failed=false
Oct 11 11:47:04 cosmo nomad[4074065]:     2023-10-11T11:47:04.730+0100 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=4bd91cd4-0560-4e4a-a506-af97f985c137 task=prometheus error="prestart hook \"artifacts\" failed: failed to download artifact \"https://github.com/grafana/mimir/raw/main/operations/mimir-mixin-compiled/rules.yaml\": getter subprocess failed: exit status 1"
shoenig commented 11 months ago

Hi @Cottand, you can see the file paths Nomad is trying to restrict with landlock in https://github.com/hashicorp/nomad/blob/main/client/allocrunner/taskrunner/getter/util_linux.go

Note that we already try to avoid locking on anything that does not exist so it is suprising to see landlock failed to lock: no such file or directory".

cottand commented 11 months ago

I am very unfamiliar with landlock, but NixOS makes extensive use of soft and hard links.

None of the paths in the file you linked stand out to me like NixOS-specific, but I am not an expert on the inner workings of the distro either

root@x /# ls -l /usr/local/bin                                                                                                                                                                                  2
ls: cannot access '/usr/local/bin': No such file or directory
root@x /# ls -l /usr/bin                                                                                                                                                                                        2
total 4
lrwxrwxrwx 1 root root 65 Oct 11 11:24 env -> /nix/store/j4fwy5gi1rdlrlbk2c0vnbs7fmlm60a7-coreutils-9.1/bin/env
root@x /# ls -l /bin
total 4
lrwxrwxrwx 1 root root 75 Oct 11 11:24 sh -> /nix/store/kxkdrxvc3da2dpsgikn8s2ml97h88m46-bash-interactive-5.2-p15/bin/sh
root@x /# ls -l /

maybe this will be useful

acaloiaro commented 11 months ago

This may be related to how Nix "generations" and symlinks work.

When Nix systems switch to new generations, a bunch of symlinks get updated to point to new locations.

NixOS provides an option called environment.pathsToLink, and a common value for it is environment.pathsToLink = [ "/libexec" ];.

This makes /libexec point to the "current generation" of the system at any point in time. So if a user runs nixos-rebuild switch, and the inode that the symlink previously pointed at is collected with nix-collect-garbage (or some other means) and if /libexec's inode was cached anywhere, it will now point to an inaccessible inode. And I see that landlock appears to be locking /libexec.

It's possible that an inode that landlock is attempting to lock no longer exists.

cottand commented 11 months ago

I do not have that particular option set (and I do not have /libexec nor /usr/libexec) but I do have other landlocked files symlinked that are part of my config, specifically, the ssh config:

# ls -la /etc/ssh

...
lrwxrwxrwx  1 root root   22 Oct 12 10:52 moduli -> /etc/static/ssh/moduli
lrwxrwxrwx  1 root root   26 Oct 12 10:52 ssh_config -> /etc/static/ssh/ssh_config
lrwxrwxrwx  1 root root   27 Oct 12 10:52 sshd_config -> /etc/static/ssh/sshd_config
lrwxrwxrwx  1 root root   31 Oct 12 10:52 ssh_known_hosts -> /etc/static/ssh/ssh_known_hosts
tgross commented 11 months ago

Nobody's mentioned it yet, but I do want to point out the workaround which is to disable the Landlock filesystem isolation for artifacts via the agent's client.disable_filesystem_isolation option. That'll at least get folks unblocked while this discussion goes on.

koalalorenzo commented 5 months ago

Also running on NixOS and disabling the filesystem isolation is mitigating the issue for now :(