ressu / kube-plex

Scalable Plex Media Server on Kubernetes -- dispatch transcode jobs as pods on your cluster!
Apache License 2.0
107 stars 23 forks source link

Plex Transcoder fails with Protected process returned an error #43

Closed rkbennett closed 1 year ago

rkbennett commented 1 year ago

Kubernetes: v1.27.2 plex version: plexinc/pms-docker:1.32.3.7192-7aa441827

When attempting to stream a movie, plex fails with "Protected process returned an error: exit status 1"

This is the only log generated that I can see related to kube-plex. Even attempting to manually execute the binary from inside the container results in this error with no other output. I've tried both the ghcr.io/ressu/kube-plex:latest, and building this repo's main branch and both result in the same error.

I am using SMB-based PVCs for config, data, and transcode. Though I'm not sure that's really relevant sense it's not even getting that far.

I forgot to add that no jobs or transcode pods are ever generated.

ressu commented 1 year ago

Could you include your configuration here, that'll make it easier to understand what could be happening. Feel free to remove any passwords and change addresses if there are any in your configuration.

rkbennett commented 1 year ago

You're talking about the kubernetes manifests?

ressu commented 1 year ago

Depending on the way how you configured your kube-plex, the manifests or the helm configuration. Either one works.

rkbennett commented 1 year ago
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: plex
  namespace: plex
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: plex-smb
provisioner: smb.csi.k8s.io
parameters:
  source: "//nas/plex-content"
  csi.storage.k8s.io/provisioner-secret-name: "smbcreds"
  csi.storage.k8s.io/provisioner-secret-namespace: "kube-system"
  csi.storage.k8s.io/node-stage-secret-name: "smbcreds"
  csi.storage.k8s.io/node-stage-secret-namespace: "kube-system"
reclaimPolicy: Retain
volumeBindingMode: Immediate
mountOptions:
  - dir_mode=0555
  - file_mode=0444
  - uid=0
  - gid=0
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: plex-smb
  annotations:
    pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: plex-smb
  mountOptions:
    - dir_mode=0777
    - file_mode=0777
    - uid=1000
    - gid=1000
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: "nas/plex-content"
    volumeAttributes:
      source: "//nas/plex-content"
    nodeStageSecretRef:
      name: "smbcreds"
      namespace: kube-system
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: plex-smb-config
  annotations:
    pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: common-smb
  mountOptions:
    - dir_mode=0777
    - file_mode=0777
    - uid=1000
    - gid=1000
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: "nams/k8s/plex"
    volumeAttributes:
      source: "//nas/k8s"
      subDir: plex
    nodeStageSecretRef:
      name: "smbcreds"
      namespace: kube-system
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: plex-transcode
  annotations:
    pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: common-smb
  mountOptions:
    - dir_mode=0777
    - file_mode=0777
    - uid=1000
    - gid=1000
  csi:
    driver: smb.csi.k8s.io
    readOnly: false
    volumeHandle: "nas/k8s/transcode"
    volumeAttributes:
      source: "//nas/k8s"
      subDir: plex
    nodeStageSecretRef:
      name: "smbcreds"
      namespace: kube-system
---
# Source: kube-plex/templates/volumes.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: plex-smb
  namespace: plex
  labels:
    app: kube-plex
spec:
  accessModes:
  - "ReadOnlyMany"
  resources:
    requests:
      storage: "20Gi"
  volumeName: plex-smb
  storageClassName: plex-smb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: plex-smb-config
  namespace: plex
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  volumeName: plex-smb-config
  storageClassName: common-smb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: plex-transcode
  namespace: plex
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  volumeName: plex-transcode
  storageClassName: common-smb
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: plex
  namespace: plex
  labels:
    app: kube-plex
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - pods/attach
  - pods/exec
  - pods/portforward
  - pods/proxy
  verbs:
  - create
  - delete
  - deletecollection
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - delete
  - deletecollection
  - get
  - list
  - patch
  - update
  - watch
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: plex
  namespace: plex
  labels:
    app: kube-plex
    chart: kube-plex-0.2.7
    release: plex
    heritage: Helm
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: plex
subjects:
- kind: ServiceAccount
  name: plex
  namespace: "plex"
---
# Source: kube-plex/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: plex
  namespace: plex
  labels:
    app: kube-plex
spec:
  type: ClusterIP
  ports:
    - name: pms
      port: 80
      protocol: TCP
      targetPort: 32400
    - name: pms-internal
      port: 32400
      protocol: TCP
      targetPort: 32400
    - name: https
      port: 443
      targetPort: 32443
  selector:
    app: kube-plex
---
# Source: kube-plex/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: plex-ingress
  namespace: plex
  labels:
    app: kube-plex
  annotations:
    cert-manager.io/issuer: common-issuer
    cert-manager.io/issuer-kind: ClusterIssuer
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "30"
    nginx.ingress.kubernetes.io/proxy-read-timeout: '36000'
    nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
    nginx.ingress.kubernetes.io/proxy_http_version: '1.1'
    nginx.ingress.kubernetes.io/proxy_set_header: Upgrade
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
    - host: plex.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: plex
                port:
                  number: 32400
  tls:
  - hosts:
    - plex.local
    secretName: tls-plex-ingress
---
# Source: kube-plex/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: plex
  namespace: plex
  labels:
    app: kube-plex
spec:
  replicas: 1
  revisionHistoryLimit: 3
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: kube-plex
  template:
    metadata:
      labels:
        app: kube-plex
      annotations:
        kube-plex/container-name: "kube-plex-init"
        kube-plex/pms-addr: plex:32400
        kube-plex/pms-container-name: "plex"
    spec:
      serviceAccountName: plex
      hostname: "plex"
      hostNetwork: true
      initContainers:
      - name: kube-plex-install
        image: "ghcr.io/ressu/kube-plex:latest"
        imagePullPolicy: Always
        command:
        - cp
        - /kube-plex
        - /shared/kube-plex
        volumeMounts:
        - name: shared
          mountPath: /shared
      containers:
      - name: plex
        image: "plexinc/pms-docker:1.32.3.7192-7aa441827"
        imagePullPolicy: IfNotPresent
        # We replace the PMS binary with a postStart hook to save having to
        # modify the default image entrypoint.
        lifecycle:
          postStart:
            exec:
              command:
              - bash
              - -c
              - |
                #!/bin/bash
                set -e
                mv '/usr/lib/plexmediaserver/Plex Transcoder' '/usr/lib/plexmediaserver/Plex Transcoder.orig'
                cp /shared/kube-plex '/usr/lib/plexmediaserver/Plex Transcoder'
        livenessProbe:
          tcpSocket:
            port: 32400
          initialDelaySeconds: 0
          failureThreshold: 3
          timeoutSeconds: 1
          periodSeconds: 10
        readinessProbe:
          tcpSocket:
            port: 32400
          initialDelaySeconds: 0
          failureThreshold: 3
          timeoutSeconds: 1
          periodSeconds: 10
        startupProbe:
          tcpSocket:
            port: 32400
          initialDelaySeconds: 0
          failureThreshold: 30
          timeoutSeconds: 1
          periodSeconds: 5
        env:
        - name: TZ
          value: "Etc/UTC"
        # TODO: move this to a secret?
        - name: PLEX_CLAIM
          valueFrom:
            secretKeyRef:
              name: plex-claim
              key: claim
        # kube-plex env vars
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ADVERTISE_IP
          value: |
            https://plex.local:443
            ,https://plex.local
        volumeMounts:
        - name: data
          mountPath: /data
        - name: config
          mountPath: /config
        - name: transcode
          mountPath: /transcode
        - name: shared
          mountPath: /shared
        resources:
          {}
      nodeSelector:
        kubernetes.io/arch: amd64
        compatibility: plex
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: "plex-smb"
      - name: config
        persistentVolumeClaim:
          claimName: "plex-smb-config"
      - name: transcode
        persistentVolumeClaim:
          claimName: "plex-transcode"
      - name: shared
        emptyDir: {}
---
###Add Secret!!!
###
###kubectl create secret generic plex-claim --from-literal=claim="<Insert Secret Here>" -n plex
rkbennett commented 1 year ago

Sorry it is a bit extra to read, but I included all my external pv/pvc and some sc definitions inside it for easier management

ressu commented 1 year ago

That's actually perfect. It allows me to read through everything I need.

On first look everything looks exactly as it's supposed to. If the jobs don't get created, the only thing that comes to mind is that the transcode process is having issues connecting to the Kubernetes API. Which Kubernetes version are you using?

rkbennett commented 1 year ago

1.27.2

rkbennett commented 1 year ago

Is there a way to see the error message from the child process on the recursive call of kube-plex?

ressu commented 1 year ago

kube-plex tries to write to the Plex log (which can be seen from the Plex admin console), but of there is nothing in that log then the process dies before anything gets logged. The process should start logging early, so there should be something in the log regardless of the reason why kube-plex fails.

I also checked that there aren't any immediately obvious API changes in Kubernetes either (I'm still running 1.26.x), so that shouldn't be the case either.

rkbennett commented 1 year ago

Okay, I'll scale it down and nuke the probes and try again, may have just lost it in the sea of logs. Do you know if it would be a warning or error log in the console? Or would it just be a debug?

ressu commented 1 year ago

You should be able to find it by searching for Kube or KubePlex. The error level depends on the log level, so I would expect any errors to show up in error log.

rkbennett commented 1 year ago

Okay, I figured out that initial issue, I had the initContainer typoed in the annotation on my live code. Once I fixed that it started kicking off jobs and pods, but the streams still won't start, now I see this error in the console [Req#745b] [KubePlex] "level"=0 "msg"="Context terminated with error:%!(EXTRA *errors.errorString=context canceled)\n"

rkbennett commented 1 year ago

Interestingly, once this error happens, it will no longer spawn jobs/pods when I attempt to do subsequent streams

rkbennett commented 1 year ago

Actually disregard that last one, it just took a long time for it to re-attempt it. Still always exits with that context error before streaming

rkbennett commented 1 year ago

So after some more testing have a few more behaviors to add, maybe they'll help decode the issue. I'm using hostNetwork: true on the deployment, curious if that has been tested successfully When I turn off hostNetwork it hangs for a while when attempting to launch a stream and then fails with a network error in the UI (same context error in the console) Once the context error happens I have to refresh the page to get kube-plex to spawn a new job/pod when a stream is attempted, simply retrying to play the stream never spawns an elastic transcode job

ressu commented 1 year ago

I haven't tested kube-plex with hostNetwork: true. An immediate issue I can think of is the fact that while the main deployment would respect hostNetwork the transcoding jobs should ignore the setting completely. Depending on how the cluster has been set up, that might also mean that the plex service doesn't have connectivity to the plex instance.

I'm guessing the hostnetwork is on for local network discovery?

That all being said, if the networking works through the service resource I don't see any other reasons why this might be an issue.

Tracing through the code, the only way how the error Context terminated with error:%!(EXTRA *errors.errorString=context canceled) could be reached is when Plex kills the transcode process. The wrapper is waiting for the transcode job to do its thing. I was going to propose to see if there are logs from this code, but it seems I never wrote logging for successful requests in an attempt to minimise noise

https://github.com/ressu/kube-plex/blob/8ae4e8a8c58b3ce242a990fdd6053524b6dff6ef/cmd/kube-plex/codecserve.go#L19

But I think we might be narrowing down on the issue.

rkbennett commented 1 year ago

Yeah, I use hostNetwork for a couple reasons. one if for the DLNA, the other is because if you don't, all traffic is presumed to be remote due to the ingress adding in an "unrecognized domain". I tried to fix this with that Custom access URL setting in the admin menu, but it kept doing it. For some reason hostNetwork was the only thing that would fix it. I was trying to understand exactly how the traffic flows in this example, but I assume that your browser/app still talks to the plex pod and the only traffic to/from the transcoder is between the plex pod and the elastic pods.

On a side note, you think the codecPackage is causing the context exits?

ressu commented 1 year ago

Yeah, I use hostNetwork for a couple reasons. one if for the DLNA, the other is because if you don't, all traffic is presumed to be remote due to the ingress adding in an "unrecognized domain".

Makes sense, this is due to the network detection. You should be able to fix this by adding the ingress pod IP range to your trusted networks in Plex settings.

For some reason hostNetwork was the only thing that would fix it. I was trying to understand exactly how the traffic flows in this example, but I assume that your browser/app still talks to the plex pod

When using Ingress, your browser talks to the ingress pod, which then opens a new connection to your Plex pod. This means that the caller IP address is going to be the IP of your ingress pod. Plex relies on the caller IP address to make decisions whether the connection is local or not. So when using Ingress, as long as the Ingress is considered local the incoming connection is considered local.

and the only traffic to/from the transcoder is between the plex pod and the elastic pods.

It's actually reversed. The transcoder pods talk back to the Plex pod. Plex assumes that the transcoder is running locally and is reporting back through its http interface. Kube-plex tricks the transcoder to talk to a supplicant which then forwards the traffic back to Plex via the service definition.

On a side note, you think the codecPackage is causing the context exits?

It's not likely that codecPackage is the origin of the issue. There is enough logging there that any error cases should be logged properly. I wanted to use the codec downloading to see if the transcoder pod is able to connect to the Plex pod in any way. The kube-plex wrapper seems to be running as intended, but the transcoder pod is having some issues either starting or talking back to the Plex pod.

rkbennett commented 1 year ago

Okay, I found a new error now in the logs [Req#1147] [KubePlex] "msg"="failed to create tar package: writing to tar failed: write tcp 192.168.2.212:33985->10.244.3.25:41218: write: broken pipe\n" "error"=null that one mean anything?

ressu commented 1 year ago

That would mean that the download is trying to start, but cuts off mid download. Which is a bit odd as the tar gets loaded into memory and we're talking about a few megabytes of data, which shouldn't be an issue. Can you see if you can find any of the log lines from here in the logs?

https://github.com/ressu/kube-plex/blob/48f44b0e73c0316b0c2ca95d5e07d880f4913f4e/cmd/transcode-launcher/main.go#L37-L38

Those should be output right before the codec download begins

rkbennett commented 1 year ago

Yessir, it makes it down to "Transcode begins..." before exiting on context

rkbennett commented 1 year ago

That error that I previously posted only happens when the plex pod is on hostNetwork

ressu commented 1 year ago

Hmm, that would mean that the issue is in the plex native transcoder then. That's when kube-plex hands over the controls back to the plex native processes. Let's try and crank the loglevel to max. Try adding kube-plex/loglevel: debug annotation to your deployment. That should pass through the value to the transcoder process and log a lot more. If I remember correctly, the logs are listed as Transcoder or something similar. Maybe those logs give us the last hint on what is going wrong.

rkbennett commented 1 year ago

Okay, I'll kick off another one and filter for transcoder

rkbennett commented 1 year ago

So with that, this one is the one that sticks out to me. Keep in mind I disabled hostNetwork for these tests Completed: [10.244.2.0:43628] 404 GET /video/:/transcode/universal/session/oxar7lggbpmbfkt8o7zul4hu/1/header (19 live) #7d1 GZIP 142570ms 379 bytes (pipelined: 1)

ressu commented 1 year ago

I wonder if that is your browser making that call. It looks pretty similar to what the transcoder will post to the Plex process, but the fact that this is a get request makes me think of a browser rather than the transcoder.

Let's try and take a closer look at the transcoder command. There should be a log line in the lines of Transcode requested with command (feel free to sanitize it before pasting).

rkbennett commented 1 year ago

[Req#6f5] [KubePlexProxy] "level"=0 "msg"="Transcode requested with command /usr/lib/plexmediaserver/Plex Transcoder, args = [-loglevel debug -loglevel_plex debug -codec:0 h264 -codec:1 dca -analyzeduration 20000000 -probesize 20000000 -i /data/movies/Push.mkv -filter_complex [0:0]scale=w=720:h=404:force_divisible_by=4[0];[0]format=pix_fmts=yuv420p|nv12[1] -map [1] -metadata:s:0 language=eng -codec:0 libx264 -crf:0 21 -maxrate:0 1724k -bufsize:0 3448k -r:0 23.975999999999999 -preset:0 veryfast -x264opts:0 subme=2:me_range=4:rc_lookahead=20:me=hex -force_key_frames:0 expr:gte(t,n_forced*3) -filter_complex [0:1] aresample=async=1:ochl='stereo':rematrix_maxval=0.000000dB:osr=48000[2] -map [2] -metadata:s:1 language=eng -codec:1 aac -b:1 162k -f dash -seg_duration 3 -dash_segment_type mp4 -init_seg_name init-stream$RepresentationID$.m4s -media_seg_name chunk-stream$RepresentationID$-$Number%05d$.m4s -window_size 5 -delete_removed false -skip_to_segment 1 -time_delta 0.0625 -manifest_name http://127.0.0.1:32400/video/:/transcode/session/oxar7lggbpmbfkt8o7zul4hu/c2752f3e-e00e-4153-9c01-7ed58cabc2af/manifest?X-Plex-Http-Pipeline=infinite -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 dash -start_at_zero -copyts -vsync cfr -y -nostats -loglevel quiet -loglevel_plex error -progressurl http://127.0.0.1:32400/video/:/transcode/session/oxar7lggbpmbfkt8o7zul4hu/c2752f3e-e00e-4153-9c01-7ed58cabc2af/progress]\n"

ressu commented 1 year ago

Ok, I'm stumped. That looks exactly as it is supposed to. So we know that the communications between main pod and transcoder works (we can see it from the logs). We know that the kubernetes scheduling is working as intended since the launcher starts. So I don't see any reason why this wouldn't work.

Does the transcode pod create anything in the transcode share? usually it creates some temporary files there very early in the transcode process.

rkbennett commented 1 year ago

I see it create up to the session directory, but never anything inside of it, Like this folder gets created /transcode/Transcode/Sessions/plex-transcode-y1fdnrjbltsky5swks0lts32-99c520c8-4cf7-4209-966c-ec77432a1876/ but remains empty

rkbennett commented 1 year ago

I can see a "WebSocket: client initiated close" not sure if that's related

rkbennett commented 1 year ago

I was finally able to get a couple things to transcode, but the delay is so long that it only works in the web browser. How long is the normal amount of delay for a transcode to begin?

ressu commented 1 year ago

The common transcode delay for me is 1-2 seconds which I've usually chalked up to slow system. And I haven't had issues with TV and other clients.

Do you have any idea what could have been the thing that got things to finally move forward? Maybe there is something we can improve in either documentation or in the code.

rkbennett commented 1 year ago

Yeah, so I definitely couldn't use hostNetwork, because then the plex couldn't communicate with the elastic transcoders. I think once I cleaned that up and then put a port forward in my firewall it started working as upnp won't auto-configure since that's a broadcast thing. At least it won't propagate outside the network namespace plex is running in. As far as delay, I'm getting like 45+ seconds of delay, it also only appears to work on TV shows, when I try to do movies it just hangs too long. Not sure where the hang up is. I'm definitely willing to research it more, but I'm having trouble finding any kind of good documentation on how the interactions between the primary plex process and the transcode processes work normally.

ressu commented 1 year ago

45 seconds is definitely long. I've been trying to think of ways to speed up the transcoder startup. One option I can think of is pre-creating the transcoders and using an RPC call to trigger the transcode on the other end. That way there is no delay from creating the job, it would also allow the pods to stay around for longer in case we need to troubleshoot the contents for any reason.

rkbennett commented 1 year ago

Yeah, I had a similar thought about prestaging. Though I want to do some more research on the Plex container, kinda sucks they don't have any documentation on their transcode binary, just whatever is in the help.

rkbennett commented 1 year ago

Think I may have found another problem I was having, SQLite over smb is super slow/unstable, allegedly. Haven't had a chance to test a block-level storageclass yet. What type of storage class do you use for the /config dir?

ressu commented 1 year ago

I'm using an iSCSI storage from my NAS as the storage for config. I know that there are some users using SMB for their storage without major issues. NFS has serious issues though, so I wouldn't recommend using NFS as storage.

rkbennett commented 1 year ago

After switching to a block storage-based pvc, plex seems dramatically faster. I'll retest the elastic transcoders tomorrow, but I assume that'll have solved my core issue I was having.

rkbennett commented 1 year ago

So I retested with the new pvc and it still takes a little more than a minute between the initial play request and when the stream starts. Since this is getting pretty far out of scope of my initial issue, I'm going to close this as solved with the takeaway that you get the initial error if it can't find a matching init container name. I'll open a new ticket for the latency issues.