pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
409 stars 33 forks source link

gpu passthrough to plex-workers in docker swarm not working #314

Closed alex-w-k closed 5 days ago

alex-w-k commented 3 weeks ago

Describe the bug after playing with a few things, it appears that i can get the containers to run with the following config, but i am not seeing any activity on my gpus,

Additional property devices is not allowed` To Reproduce Steps to reproduce the behavior: current docker compose stack:

version: '3.8'

services:
  plex:
    image: ghcr.io/linuxserver/plex:latest
    deploy:
      mode: replicated
      replicas: 1
    environment:
      DOCKER_MODS: "ghcr.io/pabloromeo/clusterplex_dockermod:latest"
      VERSION: docker
      PUID: 1000
      PGID: 1000
      TZ: ${TZ}
      ORCHESTRATOR_URL: http://plex-orchestrator:3500
      PMS_SERVICE: plex     # This service. If you disable Local Relay then you must use PMS_IP instead
      PMS_PORT: "32400"
      TRANSCODE_OPERATING_MODE: both #(local|remote|both)
      TRANSCODER_VERBOSE: "1"   # 1=verbose, 0=silent
      LOCAL_RELAY_ENABLED: "1"
      LOCAL_RELAY_PORT: "32499"
    healthcheck:
      test: curl -fsS http://localhost:32400/identity > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 30s
    volumes:
      - /ceph/docker-data/plex/config:/config
      - /mnt:/mnt
      - /ceph/docker-data/plex/transcode:/transcode
    ports:
      - 32499:32499     # LOCAL_RELAY_PORT
      - 32400:32400
      - 3005:3005
      - 8324:8324
      - 1900:1900/udp
      - 32410:32410/udp
      - 32412:32412/udp
      - 32413:32413/udp
      - 32414:32414/udp

  plex-orchestrator:
    image: ghcr.io/pabloromeo/clusterplex_orchestrator:latest
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        order: start-first
    healthcheck:
      test: curl -fsS http://localhost:3500/health > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 30s
    environment:
      TZ: ${TZ}
      LISTENING_PORT: 3500
      WORKER_SELECTION_STRATEGY: "LOAD_RANK" # RR | LOAD_CPU | LOAD_TASKS | LOAD_RANK (default)
    volumes:
      - /etc/localtime:/etc/localtime:ro
    ports:
      - 3500:3500

  plex-worker:

    image: ghcr.io/linuxserver/plex:latest
    hostname: "plex-worker-{{.Node.Hostname}}"
    deploy:
      mode: replicated
      replicas: 2
      placement:
        constraints:
          - node.labels.gpu==true
    environment:
      DOCKER_MODS: "ghcr.io/pabloromeo/clusterplex_worker_dockermod:latest"
      VERSION: docker
      PUID: 1000
      PGID: 1000
      TZ: ${TZ}
      LISTENING_PORT: 3501      # used by the healthcheck
      STAT_CPU_INTERVAL: 2000   # interval for reporting worker load metrics
      ORCHESTRATOR_URL: http://plex-orchestrator:3500
      EAE_SUPPORT: "1"
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: all
      FFMPEG_HWACCEL: "nvdec"
    healthcheck:
      test: curl -fsS http://localhost:3501/health > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 240s
    volumes:
      - /mnt:/mnt
      - /ceph/docker-data/plex/transcode:/transcode

Expected behavior I expect gpu passthrough to work

Desktop (please complete the following information):

albertsj1 commented 1 week ago

I'm not very experienced with this, but I'm pretty sure you need to add /dev/dri (or whatever works for your particular graphics card) to your list of volumes for the worker and main plex nodes.

Here's my config

pabloromeo commented 1 week ago

I believe you may missing the part that specifies the runtime as "nvidia". Also, take a look at the instructions on the official linuxserver image, as it applies to the clusterplex workers. There's a section for Nvidia hardware transcoding: https://hub.docker.com/r/linuxserver/plex

alex-w-k commented 1 week ago

I'm not very experienced with this, but I'm pretty sure you need to add /dev/dri (or whatever works for your particular graphics card) to your list of volumes for the worker and main plex nodes.

Here's my config

cool, thank you! i will try this and see if it helps anything!

alex-w-k commented 1 week ago

I believe you may missing the part that specifies the runtime as "nvidia". Also, take a look at the instructions on the official linuxserver image, as it applies to the clusterplex workers. There's a section for Nvidia hardware transcoding: https://hub.docker.com/r/linuxserver/plex

this is not the case for swarm mode, you cannot specify a runtime in docker swarm as per my understanding, although you can specify on each docker host a default runtime in the /etc/docker/daemon.json which i have done on both nodes with gpus

pabloromeo commented 6 days ago

I believe the user in this comment got it working with their Nvidia GPU while running in Docker Swarm.

https://github.com/pabloromeo/clusterplex/pull/81#issuecomment-1868399737

Maybe adding the generic resource request similar to how they did it is what's missing.

alex-w-k commented 5 days ago

I believe the user in this comment got it working with their Nvidia GPU while running in Docker Swarm.

#81 (comment)

Maybe adding the generic resource request similar to how they did it is what's missing.

I'm not very experienced with this, but I'm pretty sure you need to add /dev/dri (or whatever works for your particular graphics card) to your list of volumes for the worker and main plex nodes.

Here's my config

between these two i was able to narrow down the issue, and then finally adding the gpu as a device rather than just a volume:

devices:
      - /dev/dri:/dev/dri

now my full worker config looks like this:

plex-worker:

    image: ghcr.io/pabloromeo/clusterplex_worker:latest
    hostname: "plex-worker-{{.Node.Hostname}}"
    deploy:
      mode: replicated
      replicas: 2
      placement:
        constraints:
          - node.labels.gpu==true
    environment:
      VERSION: docker
      PUID: 1000
      PGID: 1000
      TZ: ${TZ}
      LISTENING_PORT: 3501      # used by the healthcheck
      STAT_CPU_INTERVAL: 2000   # interval for reporting worker load metrics
      ORCHESTRATOR_URL: http://plex-orchestrator:3500
      EAE_SUPPORT: "1"
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: all
      FFMPEG_HWACCEL: cuda
    healthcheck:
      test: curl -fsS http://localhost:3501/health > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 240s
    devices:
      - /dev/dri:/dev/dri
    volumes:
      - /ceph/docker-data/plex/codecs:/codecs
      - /mnt:/mnt
      - /ceph/docker-data/plex/transcode:/transcode
pabloromeo commented 5 days ago

Awesome! Thank you so much for sharing the working config. I'm sure it'll be useful for others.