pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
469 stars 36 forks source link

now HW transcoding on K8s (talos) #303

Closed evanrich closed 6 months ago

evanrich commented 6 months ago

Describe the bug Doesn't appear to find GPUs

To Reproduce 1.) install i915 gpu operator 2.) install this via helm 3.) set FFMPEG_HWACCEL: vaapi

use the following helm values:

global:
  # -- Configure the plex image that will be used for the PMS and Worker components
  # @default -- See below
  plexImage:
    # -- The image tag to use
    tag: 1.40.2
    # -- Defines when the image should be pulled. Options are Always (default), IfNotPresent, and Never
    imagePullPolicy: IfNotPresent

  # -- The timezone configured for each pod
  timezone: America/Los_Angeles
  sharedStorage:
    # -- Configure the volume that will be mounted to the PMS and worker pods for a shared location for transcoding files.
    # @default -- See below
    transcode:
      # -- Enable or disable the transcode PVC. This should only be disabled if you are not using the workers.
      enabled: true
      storageClass: "-"
      existingClaim: plextranscode-pvc
      subPath: "clusterplex"
    media:
      # -- Enables or disables the volume
      enabled: true

      # -- Storage Class for the config volume.
      # If set to `-`, dynamic provisioning is disabled.
      # If set to something else, the given storageClass is used.
      # If undefined (the default) or set to null, no storageClassName spec is set, choosing the default provisioner.
      # NOTE: This class must support ReadWriteMany otherwise you will encounter errors.
      storageClass: "-"

      # -- If you want to reuse an existing claim, the name of the existing PVC can be passed here.
      existingClaim: plexmedia-pvc
  # -- Configure the Plex Media Server component
# @default -- See below
pms:
  # -- Enable or disable the Plex Media Server component
  enabled: true

  # -- Additional environment variables. Template enabled.
  # Syntax options:
  # A) TZ: UTC
  # B) PASSWD: '{{ .Release.Name }}'
  # C) PASSWD:
  #      configMapKeyRef:
  #        name: config-map-name
  #        key: key-name
  # D) PASSWD:
  #      valueFrom:
  #        secretKeyRef:
  #          name: secret-name
  #          key: key-name
  #      ...
  # E) - name: TZ
  #      value: UTC
  # F) - name: TZ
  #      value: '{{ .Release.Name }}'
  env:

    PASSWD:
      valueFrom:
        secretKeyRef:
          name: plex-claim-token
          key: claim_token

  # -- Supply the configuration items used to configure the PMS component
  # @default -- See below
  config:
    # -- Set this to 1 if you want only info logging from the transcoder or 0 if you want debugging logs
    transcoderVerbose: 1

    # -- Set the transcode operating mode. Valid options are local (No workers), remote (only remote workers), both (default, remote first then local if remote fails).
    # If you disable the worker then this will be set to local automatically as that is the only valid option for that confguration.
    transcodeOperatingMode: both

    # -- Set the Plex claim token obtained from https://plex.tv/claim
    plexClaimToken: 

    # -- Set the version of Plex to use. Valid options are docker, latest, public, or a specific version.
    # [[ref](https://github.com/linuxserver/docker-plex#application-setup)]
    version: docker

    # -- The port that Plex will listen on
    port: 32400

    # -- Enable or disable the local relay function. In most cases this should be left to the default (true).
    # If you disable this, you must add the pod IP address of each worker or the pod network CIDR to Plex under the
    # `List of IP addresses and networks that are allowed without auth` option in Plex's network configuration.
    localRelayEnabled: true

    # -- The port that the relay service will listen on
    relayPort: 32499

    # -- The IP address that plex is using. This is only utilized if you disable the localRelayEnabled option above.
    pmsIP: ""

  # -- Configure the kubernetes service associated with the the PMS component
  # @default -- See below
  serviceConfig:
    # Configure the type of service
    type: ClusterIP

    # -- Specify the externalTrafficPolicy for the service. Options: Cluster, Local
    # [[ref](https://kubernetes.io/docs/tutorials/services/source-ip/)]
    externalTrafficPolicy:

    # -- Provide additional annotations which may be required.
    annotations: {}

    # -- Provide additional labels which may be required.
    labels: {}

  # -- Configure the ingress for plex here.
  # @default -- See below
  ingressConfig:
    # -- Enables or disables the ingress
    enabled: false

    # -- Provide additional annotations which may be required.
    annotations:
      {}
      # kubernetes.io/ingress.class: nginx
      # kubernetes.io/tls-acme: "true"

    # -- Provide additional labels which may be required.
    labels: {}

    # -- Set the ingressClass that is used for this ingress.
    ingressClassName: # "nginx"

    ## Configure the hosts for the ingress
    hosts:
      - # -- Host address. Helm template can be passed.
        host: chart-example.local
        ## Configure the paths for the host
        paths:
          - # -- Path.  Helm template can be passed.
            path: /
            pathType: Prefix
            service:
              # -- Overrides the service name reference for this path
              name:
              # -- Overrides the service port reference for this path
              port:

    # -- Configure TLS for the ingress. Both secretName and hosts can process a Helm template.
    tls: []
    #  - secretName: chart-example-tls
    #    hosts:
    #      - chart-example.local

  # -- Configure the volume that stores all the Plex configuration and metadata
  # @default -- See below
  configVolume:
    # -- Enables or disables the volume
    enabled: true

    # -- Storage Class for the config volume.
    # If set to `-`, dynamic provisioning is disabled.
    # If set to something else, the given storageClass is used.
    # If undefined (the default) or set to null, no storageClassName spec is set, choosing the default provisioner.
    storageClass: ceph-block

    # -- If you want to reuse an existing claim, the name of the existing PVC can be passed here.
    existingClaim: # your-claim

    # -- Used in conjunction with `existingClaim`. Specifies a sub-path inside the referenced volume instead of its root
    subPath: # some-subpath

    # -- AccessMode for the persistent volume.
    # Make sure to select an access mode that is supported by your storage provider!
    # [[ref]](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
    accessMode: ReadWriteOnce

    # -- The amount of storage that is requested for the persistent volume.
    size: 250Gi

    # -- Set to true to retain the PVC upon `helm uninstall`
    retain: true

  # -- Enable or disable the various health check probes for this component
  # @default -- See below
  healthProbes:
    # -- Enable or disable the startup probe
    startup: true

    # -- Enable or disable the readiness probe
    readiness: true

    # -- Enable or disable the liveness probe
    liveness: true

  # -- Configure the resource requests and limits for the PMS component
  # @default -- See below
  resources:
    requests:
      # -- CPU Request amount
      cpu: 2000m

      # Memory Request Amount
      memory: 2Gi

    limits:
      # -- CPU Limit amount
      cpu: 4000m

      # -- Memory Limit amount
      memory: 4Gi     
worker:
  env:
    FFMPEG_HWACCEL: vaapi
  securityContext:
    privileged: true
  resources:
    requests:
      cpu: 2
      memory: 3Gi
      gpu.intel.com/i915: "1"
    limits:
      cpu: 4
      memory: 4Gi
      gpu.intel.com/i915: "1" 

Expected behavior It works

Screenshots

[AVHWDeviceContext @ 0x7f21341a4980] Trying to use DRM render node for device 0.
[AVHWDeviceContext @ 0x7f21341a4980] libva: VA-API version 1.18.0
[AVHWDeviceContext @ 0x7f21341a4980] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so
[AVHWDeviceContext @ 0x7f21341a4980] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7f21341a4980] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/i965_drv_video.so
[AVHWDeviceContext @ 0x7f21341a4980] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7f21341a4980] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
[hevc @ 0x7f2131e05b00] No device available for decoder: device type vaapi needed for codec hevc.

Desktop (please complete the following information):

Additional context

kubectl get nodes -o=jsonpath="{range .items[*]}{.metadata.name}{'\n'}{' i915: '}{.status.allocatable.gpu\.intel\.com/i915}{'\n'}"
k8s-worker-1
 i915: 1
node1
 i915: 1
node2
 i915: 1
node3
 i915: 1

that proves the gpus exist. There was someone in the issues section talking about i915 on k8s and said he finally got it to work, but I'm not sure how, as even trying to add securitycontext.privileged = true didn't work

evanrich commented 6 months ago

I made some progress, but only locally image

I added the following to the "pms" values:

 securityContext:
    privileged: true
  env:
    FFMPEG_HWACCEL: vaapi

I can now see in plex the following: image

however remote workers still show the following:

[AVHWDeviceContext @ 0x7fedcbd14a80] libva: VA-API version 1.18.0
[AVHWDeviceContext @ 0x7fedcbd14a80] libva: User requested driver 'iHD'
[AVHWDeviceContext @ 0x7fedcbd14a80] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so
[AVHWDeviceContext @ 0x7fedcbd14a80] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7fedcbd14a80] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
Failed to set value 'vaapi=vaapi:/dev/dri/renderD128,driver=iHD' for option 'init_hw_device': I/O error
Error parsing global options: I/O error
Transcoder exit: child process exited with code 1

and the main plex container shows the following:

Killing child transcoder
Distributed transcoder failed, calling local
evanrich commented 6 months ago

more info: so looking at the line:

libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so

i tried to see if that existed:

/config/Library/Application Support# ls -la
total 0
drwxr-xr-x 2 abc abc  6 Apr 27 02:49 .
drwxr-xr-x 3 abc abc 33 Apr 27 02:49

nothing there, so it seems like the workers aren't downloading drivers?

evanrich commented 6 months ago

I think I got it working:

[AVHWDeviceContext @ 0x7f3ebed8fa80] libva: VA-API version 1.18.0
[AVHWDeviceContext @ 0x7f3ebed8fa80] libva: User requested driver 'iHD'
[AVHWDeviceContext @ 0x7f3ebed8fa80] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so
[AVHWDeviceContext @ 0x7f3ebed8fa80] libva: Found init function __vaDriverInit_1_18
[AVHWDeviceContext @ 0x7f3ebed8fa80] libva: va_openDriver() returns 0
[AVHWDeviceContext @ 0x7f3ebed8fa80] Initialised VAAPI connection: version 1.18
[AVHWDeviceContext @ 0x7f3ebed8fa80] VAAPI driver: Intel iHD driver for Intel(R) Gen Graphics - 23.1.6 (418a0ffb).
[AVHWDeviceContext @ 0x7f3ebed8fa80] Driver not found in known nonstandard list, using standard behaviour.

to fix this, I had to download the Drivers/ file using

kubectl cp -n clusterplex clusterplex-pms-7c4954ddb5-lb2d9:/config/Library/Application\ Support/
Plex\ Media\ Server/Drivers/imd-115-linux-x86_64/dri/iHD_drv_video.so ./iHD_drv_video.so

and then upload this into the worker using

kubectl cp -n clusterplex ./iHD_drv_video.so clusterplex-worker-1:/config/Library/Application\ S
upport/Plex\ Media\ Server/Drivers/imd-115-linux-x86_64/dri/iHD_drv_video.so

I then had to symlink this inside the container using

ln -s '/config/Library/Application Sup
port/Plex Media Server/Drivers/imd-115-linux-x86_64/dri/iHD_drv_video.so' iHD_drv_video.so

within the /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64 folder

The question now is, why is the worker not getting this driver/cache folder created?

Edit: Nope, still says "Distributed transcoder failed, calling local"


[AVIOContext @ 0x7fcb2906bd00] Statistics: 0 bytes read, 0 seeks
[tcp @ 0x7fcb2b5e6f00] Starting connection attempt to 10.104.182.163 port 32499
[tcp @ 0x7fcb2b5e6f00] Successfully connected to 10.104.182.163 port 32499
[eac3_eae @ 0x7fcb275a7280] EAE watchfolder is not writable: /tmp/pms-75dc717a-73a3-4046-8fe2-f8d477997038/EasyAudioEncoder/Convert to WAV (to 8ch or less)/4t0y3a8tcfr4n8qhftifpklq_20206-0-test.tmp
Stream mapping:
  Stream #0:1 (eac3_eae) -> aresample:default
  Stream #0:0 -> #0:0 (copy)
  aresample:default -> Stream #0:1 (aac)
  Stream #0:2 -> #1:0 (subrip (native) -> ass (native))
Error while opening decoder for input stream #0:1 : Generic error in an external library
[AVIOContext @ 0x7fcb2a2dc380] Statistics: 550768 bytes read, 5 seeks
Transcoder exit: child process exited with code 1
Completed transcode
Removing process from taskMap
Transcoder close: child process exited with code 1```
evanrich commented 6 months ago

the solution from https://github.com/pabloromeo/clusterplex/issues/223 that audiophonicz posted works for me.