[k8s/Helm] Workers break after pms image tag 1.40.2

craigcabrey commented 3 months ago

Describe the bug

I'm trying to use the Helm chart to start up the cluster. Everything comes up except the worker pods. The worker pods use the PMS docker image with an init container. However, I don't see anything else that makes them distinct from a standard Plex docker container. Is this intended? It seems like Plex is trying to start but can't, so the pod health checks never become ready.

For what it's worth, this is an IPv6 dual-stack first cluster, so usually I have problems with software hardcoding 0.0.0.0. But the Plex container didn't require any modifications.

I cross referenced against a known working config, but no dice: https://github.com/pabloromeo/clusterplex/issues/305

See my values.yaml below.

To Reproduce Steps to reproduce the behavior:

Install the helm chart using the given values.

Expected behavior Worker pods come up

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

k8s + rook ceph

Additional context

values.yaml:

global:
  # -- Configure the plex image that will be used for the PMS and Worker components
  # @default -- See below
  plexImage:
    # -- The image that will be used
    repository: linuxserver/plex

    # -- The image tag to use
    tag: latest

    # -- Defines when the image should be pulled. Options are Always (default), IfNotPresent, and Never
    imagePullPolicy: Always

  # -- The timezone configured for each pod
  timezone: Etc/UTC

  # -- The process group ID that the LinuxServer Plex container will run Plex/Worker as.
  PGID: 1000

  # -- The process user ID that the LinuxServer Plex container will run Plex/Worker as.
  PUID: 1000

  sharedStorage:
    # -- Configure the volume that will be mounted to the PMS and worker pods for a shared location for transcoding files.
    # @default -- See below
    transcode:
      # -- Enable or disable the transcode PVC. This should only be disabled if you are not using the workers.
      enabled: true

      # -- If you want to reuse an existing claim, the name of the existing PVC can be passed here.
      existingClaim: clusterplex-transcode

    # -- Configure the media volume that will contain all of your media. If you need more volumes you need to add them under
    # the pms and worker sections manually. Those volumes must already be present in the cluster.
    # @default -- See below
    media:
      # -- Enables or disables the volume
      enabled: true

      # -- If you want to reuse an existing claim, the name of the existing PVC can be passed here.
      existingClaim: media

# -- Configure the Plex Media Server component
# @default -- See below
pms:
  # -- Enable or disable the Plex Media Server component
  enabled: true

  env:
    FFMPEG_HWACCEL: vaapi

  # -- Supply the configuration items used to configure the PMS component
  # @default -- See below
  config:
    # -- Set this to 1 if you want only info logging from the transcoder or 0 if you want debugging logs
    transcoderVerbose: 1

    # -- Set the transcode operating mode. Valid options are local (No workers), remote (only remote workers), both (default, remote first then local if remote fails).
    # If you disable the worker then this will be set to local automatically as that is the only valid option for that confguration.
    transcodeOperatingMode: both

    # -- The port that the relay service will listen on
    relayPort: 32499

  # -- Configure the ingress for plex here.
  # @default -- See below
  ingressConfig:
    # -- Enables or disables the ingress
    enabled: true

    # -- Provide additional annotations which may be required.
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod

    # -- Set the ingressClass that is used for this ingress.
    ingressClassName: internal

    ## Configure the hosts for the ingress
    hosts:
      - # -- Host address. Helm template can be passed.
        host: [snip]
        ## Configure the paths for the host
        paths:
          - # -- Path.  Helm template can be passed.
            path: /
            pathType: Prefix
            service:
              # -- Overrides the service name reference for this path
              name:
              # -- Overrides the service port reference for this path
              port:

    tls:
      - secretName: [snip]
        hosts:
          - [snip]

  # -- Configure the volume that stores all the Plex configuration and metadata
  # @default -- See below
  configVolume:
    # -- Enables or disables the volume
    enabled: true

    # -- If you want to reuse an existing claim, the name of the existing PVC can be passed here.
    existingClaim: clusterplex-config

  # -- Enable or disable the various health check probes for this component
  # @default -- See below
  healthProbes:
    # -- Enable or disable the startup probe
    startup: true

    # -- Enable or disable the readiness probe
    readiness: true

    # -- Enable or disable the liveness probe
    liveness: true

  # -- Configure the resource requests and limits for the PMS component
  # @default -- See below
  resources:
    requests:
      gpu.intel.com/i915: 1
      memory: 2Gi

    limits:
      gpu.intel.com/i915: 1
      memory: 4Gi

# -- Configure the orchestrator component
# @default -- See below
orchestrator:
  # -- Enable or disable the Orchestrator component
  enabled: true

  image:
    # -- image repository
    repository: ghcr.io/pabloromeo/clusterplex_orchestrator

    # -- image pull policy
    pullPolicy: IfNotPresent

  # -- Supply the configuration items used to configure the Orchestrator component
  # @default -- See below
  config:
    # -- The port that the Orchestrator will listen on
    port: 3500

    # -- Configures how the worker is chosen when a transcoding job is initiated.
    # Options are LOAD_CPU, LOAD_TASKS, RR, and LOAD_RANK (default).
    # [[ref]](https://github.com/pabloromeo/clusterplex/tree/master/docs#orchestrator)
    workerSelectionStrategy: LOAD_RANK

  # -- Configure the kubernetes service associated with the the PMS component
  # @default -- See below
  serviceConfig:
    # -- Configure the type of service
    type: ClusterIP

    # -- Specify the externalTrafficPolicy for the service. Options: Cluster, Local
    # [[ref](https://kubernetes.io/docs/tutorials/services/source-ip/)]
    externalTrafficPolicy:

    # -- Provide additional annotations which may be required.
    annotations: {}

    # -- Provide additional labels which may be required.
    labels: {}

  # -- Configure a ServiceMonitor for use with Prometheus monitoring
  # @default -- See below
  prometheusServiceMonitor:
    # -- Enable the ServiceMonitor creation
    enabled: false

    # -- Provide additional additions which may be required.
    annotations: {}

    # -- Provide additional labels which may be required.
    labels: {}

    # -- Provide a custom selector if desired. Note that this will take precedent over the default
    # method of using the orchestrators namespace. This usually should not be required.
    customSelector: {}

    # -- Configure how often Prometheus should scrape this metrics endpoint in seconds
    scrapeInterval: 30s

    # -- Configure how long Prometheus should wait for the endpoint to reply before
    # considering the request to have timed out.
    scrapeTimeout: 10s

  # -- Configures if the Grafana dashboard for the orchestrator component is deployed to the cluster or not.
  # If enabled, this creates a ConfigMap containing the dashboard JSON so that your Gradana instance can detect it.
  # This requires your grafana instance to have the grafana.sidecar.dashboards.enabled to be true and the searchNamespace
  # to be set to ALL otherwise this will not be discovered.
  enableGrafanaDashboard: false

  # -- Enable or disable the various health check probes for this component
  # @default -- See below
  healthProbes:
    # -- Enable or disable the startup probe
    startup: true

    # -- Enable or disable the readiness probe
    readiness: true

    # -- Enable or disable the liveness probe
    liveness: true

  # -- Configure the resource requests and limits for the orchestrator component
  # @default -- See below
  resources:
    requests:
      # -- CPU Request amount
      cpu: 200m

      # Memory Request Amount
      memory: 64Mi

    limits:
      # -- CPU Limit amount
      cpu: 500m

      # -- Memory Limit amount
      memory: 128Mi

# -- Configure the worker component
# @default -- See below
worker:
  # -- Enable or disable the Worker component
  enabled: true

  env:
    FFMPEG_HWACCEL: vaapi

  # -- Supply the configuration items used to configure the worker component
  # @default -- See below
  config:
    # -- The number of instances of the worker to run
    replicas: 2

    # -- The port the worker will expose its metrics on for the orchestrator to find
    port: 3501

    # -- The frequency at which workers send stats to the orchestrator in ms
    cpuStatInterval: 10000

    # -- Controls usage of the EasyAudioDecoder 1 = ON (default) and 0 = OFF
    eaeSupport: 1

  # -- Enable or disable the per-pod volumes that cache the codecs. This saves a great deal of time when starting the workers.
  # @default -- See below
  codecVolumes:
    # -- Enable or disable the creation of the codec volumes
    enabled: true

    storageClass: local-path

  resources:
    requests:
      gpu.intel.com/i915: 1
      memory: 2Gi

    limits:
      gpu.intel.com/i915: 1
      memory: 4Gi

Worker logs:

[craigcabrey@tealboi clusterplex]$ k logs -n media-automation clusterplex-worker-0
Defaulted container "clusterplex-worker" out of: clusterplex-worker, set-codec-permissions (init), set-transcode-permissions (init)
[mod-init] Running Docker Modification Logic
[mod-init] Adding pabloromeo/clusterplex_worker_dockermod:1.4.12 to container
[mod-init] Downloading pabloromeo/clusterplex_worker_dockermod:1.4.12 from ghcr.io
[mod-init] Installing pabloromeo/clusterplex_worker_dockermod:1.4.12
[mod-init] pabloromeo/clusterplex_worker_dockermod:1.4.12 applied to container
[migrations] started
[migrations] no migrations found
───────────────────────────────────────

      ██╗     ███████╗██╗ ██████╗
      ██║     ██╔════╝██║██╔═══██╗
      ██║     ███████╗██║██║   ██║
      ██║     ╚════██║██║██║   ██║
      ███████╗███████║██║╚██████╔╝
      ╚══════╝╚══════╝╚═╝ ╚═════╝

   Brought to you by linuxserver.io
───────────────────────────────────────

To support LSIO projects visit:
https://www.linuxserver.io/donate/

───────────────────────────────────────
GID/UID
───────────────────────────────────────

User UID:    1000
User GID:    1000
───────────────────────────────────────

**** Server is unclaimed, but no claim token has been set ****
**** adding /dev/dri/card1 to video group irc with id 39 ****
**** adding /dev/dri/renderD128 to video group render with id 105 ****
Docker is used for versioning skip update check
[custom-init] No custom files found, skipping...
Starting Plex Media Server. . . (you can ignore the libusb_init error)
[ls.io-init] done.
Error in command line:the argument for option '--serverUuid' should follow immediately after the equal sign
Crash Uploader options:

Minidump Upload options:
  --directory arg        Directory to scan for crash reports
  --serverUuid arg       UUID of the server that crashed
  --platform arg         Platform string
  --platformVersion arg  Platform version string
  --vendor arg           Vendor string
  --device arg           Device string
  --model arg            Device model string
  --allowRetries arg     Whether we will allow retries

Session Health options:
  --sessionStatus arg    Seassion health status (exited, crashed, or abnormal)
  --sessionStart arg     Session start timestamp in UTC or epoch time
  --sessionDuration arg  Session duration in seconds

Common options:
  --userId arg           User that owns this product
  --version arg          Version of the product
  --sentryUrl arg        Sentry URL to upload to
  --sentryKey arg        Sentry Key for the project
Critical: libusb_init failed

Add any other context about the problem here.

pabloromeo commented 3 months ago

From the looks of it the dockermod on the worker never actually began initializing. Logs should look quite different (it should install dependencies and also download all codecs, for example), plus the actual plex binary should never start, since the dockermod replaces it with itself.

craigcabrey commented 3 months ago

So, should the health checks be disabled for the worker containers? That stood out to me as odd.

pabloromeo commented 3 months ago

Regarding healthchecks yes, healthchecks should be enabled.

But your pod is failing because the dockermod is not running during startup so plex is starting unchanged, which is not correct. It's odd because in the first few lines linuxserver says it downloaded and installed the dockermod, but nothing ran after that. Can you try uninstalling the helm chart and trying again? Maybe linuxserver's dockermod download was corrupted or something like that.

Also (but unrelated to this current bug) for hardware transcoding to work on workers you're going to also need additional network shares for "cache" and "drivers". The working values you linked shows how to map them.

craigcabrey commented 3 months ago

plus the actual plex binary should never start

Regarding healthchecks yes, healthchecks should be enabled.

This doesn't add up for me. Does the dockermod run on as the same ports as plex & respond to the same health checks?

llajas commented 3 months ago

I believe something on the Plex side has changed considerably. @craigcabrey try pinning your version to 1.40.2. I was able to get my workers to run at this version but it seems things break afterwards.

craigcabrey commented 3 months ago

Yea, that seemed to have moved the needle, thanks! So, definitely broken after that version.

craigcabrey commented 3 months ago

Workers become healthy on their own without any modifications using that tag. @pabloromeo I'll leave it up to you if you want to keep this open to track post-1.40.2 issues.

pabloromeo commented 3 months ago

Was finally able to take a look at this. Indeed something had changed, but this time it wasn't Plex's fault, but rather the LinuxServer image. They have removed support for s6-overlay v2, and only allow v3 now. I'm rewriting the setup logic now, and hopefully later today or tomorrow i should release a version that works with the latest images from linuxserver.

imogen-ooo commented 3 months ago

Was finally able to take a look at this. Indeed something had changed, but this time it wasn't Plex's fault, but rather the LinuxServer image. They have removed support for s6-overlay v2, and only allow v3 now. I'm rewriting the setup logic now, and hopefully later today or tomorrow i should release a version that works with the latest images from linuxserver.

would this effect both the worker and the pms dockermods? i am seeing breaks in the main server as well, it boots up just like a standard pms instance

pabloromeo commented 3 months ago

Yes, the initialization code breaks for both workers and PMS. Ill get a new release out ASAP, that should cover both install options too: dockermod or custom docker images.

pabloromeo commented 3 months ago

I just released v1.4.13 of Clusterplex, new Helm Chart version 1.1.8 is also available here: https://pabloromeo.github.io/clusterplex/ It should fix this s6-overlay issue and both PMS and Workers should start up as expected.

craigcabrey commented 3 months ago

can confirm the latest chart + image comes up healthy!

pabloromeo / clusterplex

[k8s/Helm] Workers break after pms image tag 1.40.2 #324