pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
410 stars 33 forks source link

Verbose logging #248

Closed christianmerges closed 9 months ago

christianmerges commented 9 months ago

Describe the bug When i deploy in docker swarm with docker mods, the movie doesn't start, but in the dashboard section of plex, the time of the movie is already running. How can i debug this?

yaml

version: '3.8'
services:
  plex:
    image: ghcr.io/linuxserver/plex
    environment:
      DOCKER_MODS: "ghcr.io/pabloromeo/clusterplex_dockermod:latest"
      VERSION: docker
      PUID: 1000
      PGID: 1000
      TZ: Europe/Berlin
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: all
      ORCHESTRATOR_URL: http://plex-orchestrator:3500
      PMS_SERVICE: plex     # This service. If you disable Local Relay then you must use PMS_IP instead
      PMS_PORT: "32400"
      TRANSCODE_OPERATING_MODE: remote #(local|remote|both)
      TRANSCODER_VERBOSE: "1"   # 1=verbose, 0=silent
      LOCAL_RELAY_ENABLED: "1"
      LOCAL_RELAY_PORT: "32499"
    volumes:
      - /mnt/docker/plex/config:/config
      - /mnt/transcode:/transcode  #glusterfs volume on both docker hosts
      - /mnt/merge/:/cloud     #media folder
      - /etc/localtime:/etc/localtime:ro
      - /mnt/docker/plex/tmp:/tmp   #ramdisk too small when generating thumbnails etc. for new media
    networks:
      - pirate
    deploy:
      placement:
        constraints: [node.labels.type == cpu]
      mode: replicated
      replicas: 1

  plex-orchestrator:
    image: ghcr.io/pabloromeo/clusterplex_orchestrator:latest
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        order: start-first
      placement:
        constraints: [node.labels.type == cpu]
    healthcheck:
      test: curl -fsS http://localhost:3500/health > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 30s
    environment:
      TZ: Europe/Berlin
      LISTENING_PORT: 3500
      WORKER_SELECTION_STRATEGY: "LOAD_RANK" # RR | LOAD_CPU | LOAD_TASKS | LOAD_RANK (default)
    volumes:
      - /etc/localtime:/etc/localtime:ro

  plex-worker:
    image: ghcr.io/linuxserver/plex:latest
    hostname: "plex-worker-{{.Node.Hostname}}"
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.labels.type == gpu]
    environment:
      DOCKER_MODS: "ghcr.io/pabloromeo/clusterplex_worker_dockermod:latest"
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: all
      VERSION: docker
      PUID: 1000
      PGID: 1000
      TZ: Europe/Berlin
      LISTENING_PORT: 3501      # used by the healthcheck
      STAT_CPU_INTERVAL: 2000   # interval for reporting worker load metrics
      ORCHESTRATOR_URL: http://plex-orchestrator:3500
      EAE_SUPPORT: "1"
      FFMPEG_HWACCEL: "true"
    healthcheck:
      test: curl -fsS http://localhost:3501/health > /dev/null || exit 1
      interval: 15s
      timeout: 15s
      retries: 5
      start_period: 240s
    volumes:
      - /mnt/transcode:/transcode        #glusterfs volume
      - /mnt/merge:/cloud                   #media folder

networks:
  pirate:
    driver: overlay
    driver_opts:
      com.docker.network.driver.mtu: 1200

Expected behavior the worker should transcode

Server (please complete the following information):

Additional context I can see that there is a sub-folder created in the /transcode directory, but there are no files inside.

pabloromeo commented 9 months ago

I believe the value for FFMPEG_HWACCEL is incorrect. It should be set to vaapi, or cuda, or whichever driver you are looking to use.

Regarding debugging, the first thing we should probably look at is the logs of the worker. It will probably let us know what error is being thrown.

Also, for using Nvidia GPUs I believe swarm won't work for the worker, since it requires you to specify the devices and runtime, which last time I checked swarm compose didn't support. So you might need to run that as a separate compose stack. See https://github.com/pabloromeo/clusterplex/pull/81 as an example.

christianmerges commented 9 months ago

I think adding the card is working, because I see the following in the container log: adding /dev/dri/card0 to video group root with id 0 permissions for /dev/dri/renderD128 are good

looks like forwarding devices is now handled by the nvidia docker plugin.

i changed the value to "cuda". Still no stream.

strangely i can the the the video playing in small thumbnail in plex dashboard.

worker logfile last lines since container start:

Codec libzmbv_decoder.so already exists. Skipping
EAE_SUPPORT => 1
EAE_EXECUTABLE => /codecs/8217c1c-4578-linux-x86_64-standard/EasyAudioEncoder/EasyAudioEncoder/EasyAudioEncoder
FFMPEG_HWACCEL => cuda
ON_DEATH: debug mode enabled for pid [709]
Computed CPU ops => 1019419
Initializing Worker 4a462f46-0372-4b36-b47d-226c3010e098|plex-worker-gpu1
Worker listening on port 3501
Worker connected on socket enzbw1ROOaz7vZj2AAAD

I removed the FFMPEG_HWACCEL variable completely für testing, and still not transcoding. It looks like the job never arrives at the worker. When transcoding is working, should there be something in the container log on the worker?

christianmerges commented 9 months ago

The issiue was regarding to docker network.

i forgot to add networks:

to the plex containers. I figured it out by typing "wget plex:32400" at the worker and got name resolution error. But even with the ip i was able to connect.