Closed rkbennett closed 1 year ago
Could you include your configuration here, that'll make it easier to understand what could be happening. Feel free to remove any passwords and change addresses if there are any in your configuration.
You're talking about the kubernetes manifests?
Depending on the way how you configured your kube-plex, the manifests or the helm configuration. Either one works.
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: plex
namespace: plex
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: plex-smb
provisioner: smb.csi.k8s.io
parameters:
source: "//nas/plex-content"
csi.storage.k8s.io/provisioner-secret-name: "smbcreds"
csi.storage.k8s.io/provisioner-secret-namespace: "kube-system"
csi.storage.k8s.io/node-stage-secret-name: "smbcreds"
csi.storage.k8s.io/node-stage-secret-namespace: "kube-system"
reclaimPolicy: Retain
volumeBindingMode: Immediate
mountOptions:
- dir_mode=0555
- file_mode=0444
- uid=0
- gid=0
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: plex-smb
annotations:
pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
capacity:
storage: 20Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
storageClassName: plex-smb
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=1000
- gid=1000
csi:
driver: smb.csi.k8s.io
readOnly: false
volumeHandle: "nas/plex-content"
volumeAttributes:
source: "//nas/plex-content"
nodeStageSecretRef:
name: "smbcreds"
namespace: kube-system
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: plex-smb-config
annotations:
pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: common-smb
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=1000
- gid=1000
csi:
driver: smb.csi.k8s.io
readOnly: false
volumeHandle: "nams/k8s/plex"
volumeAttributes:
source: "//nas/k8s"
subDir: plex
nodeStageSecretRef:
name: "smbcreds"
namespace: kube-system
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: plex-transcode
annotations:
pv.kubernetes.io/provisioned-by: smb.csi.k8s.io
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: common-smb
mountOptions:
- dir_mode=0777
- file_mode=0777
- uid=1000
- gid=1000
csi:
driver: smb.csi.k8s.io
readOnly: false
volumeHandle: "nas/k8s/transcode"
volumeAttributes:
source: "//nas/k8s"
subDir: plex
nodeStageSecretRef:
name: "smbcreds"
namespace: kube-system
---
# Source: kube-plex/templates/volumes.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: plex-smb
namespace: plex
labels:
app: kube-plex
spec:
accessModes:
- "ReadOnlyMany"
resources:
requests:
storage: "20Gi"
volumeName: plex-smb
storageClassName: plex-smb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: plex-smb-config
namespace: plex
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
volumeName: plex-smb-config
storageClassName: common-smb
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: plex-transcode
namespace: plex
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
volumeName: plex-transcode
storageClassName: common-smb
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: plex
namespace: plex
labels:
app: kube-plex
rules:
- apiGroups:
- ""
resources:
- pods
- pods/attach
- pods/exec
- pods/portforward
- pods/proxy
verbs:
- create
- delete
- deletecollection
- get
- list
- patch
- update
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- delete
- deletecollection
- get
- list
- patch
- update
- watch
---
# Source: kube-plex/templates/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: plex
namespace: plex
labels:
app: kube-plex
chart: kube-plex-0.2.7
release: plex
heritage: Helm
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: plex
subjects:
- kind: ServiceAccount
name: plex
namespace: "plex"
---
# Source: kube-plex/templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: plex
namespace: plex
labels:
app: kube-plex
spec:
type: ClusterIP
ports:
- name: pms
port: 80
protocol: TCP
targetPort: 32400
- name: pms-internal
port: 32400
protocol: TCP
targetPort: 32400
- name: https
port: 443
targetPort: 32443
selector:
app: kube-plex
---
# Source: kube-plex/templates/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: plex-ingress
namespace: plex
labels:
app: kube-plex
annotations:
cert-manager.io/issuer: common-issuer
cert-manager.io/issuer-kind: ClusterIssuer
nginx.ingress.kubernetes.io/proxy-connect-timeout: "30"
nginx.ingress.kubernetes.io/proxy-read-timeout: '36000'
nginx.ingress.kubernetes.io/proxy-send-timeout: '1800'
nginx.ingress.kubernetes.io/proxy_http_version: '1.1'
nginx.ingress.kubernetes.io/proxy_set_header: Upgrade
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: plex.local
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: plex
port:
number: 32400
tls:
- hosts:
- plex.local
secretName: tls-plex-ingress
---
# Source: kube-plex/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: plex
namespace: plex
labels:
app: kube-plex
spec:
replicas: 1
revisionHistoryLimit: 3
strategy:
type: Recreate
selector:
matchLabels:
app: kube-plex
template:
metadata:
labels:
app: kube-plex
annotations:
kube-plex/container-name: "kube-plex-init"
kube-plex/pms-addr: plex:32400
kube-plex/pms-container-name: "plex"
spec:
serviceAccountName: plex
hostname: "plex"
hostNetwork: true
initContainers:
- name: kube-plex-install
image: "ghcr.io/ressu/kube-plex:latest"
imagePullPolicy: Always
command:
- cp
- /kube-plex
- /shared/kube-plex
volumeMounts:
- name: shared
mountPath: /shared
containers:
- name: plex
image: "plexinc/pms-docker:1.32.3.7192-7aa441827"
imagePullPolicy: IfNotPresent
# We replace the PMS binary with a postStart hook to save having to
# modify the default image entrypoint.
lifecycle:
postStart:
exec:
command:
- bash
- -c
- |
#!/bin/bash
set -e
mv '/usr/lib/plexmediaserver/Plex Transcoder' '/usr/lib/plexmediaserver/Plex Transcoder.orig'
cp /shared/kube-plex '/usr/lib/plexmediaserver/Plex Transcoder'
livenessProbe:
tcpSocket:
port: 32400
initialDelaySeconds: 0
failureThreshold: 3
timeoutSeconds: 1
periodSeconds: 10
readinessProbe:
tcpSocket:
port: 32400
initialDelaySeconds: 0
failureThreshold: 3
timeoutSeconds: 1
periodSeconds: 10
startupProbe:
tcpSocket:
port: 32400
initialDelaySeconds: 0
failureThreshold: 30
timeoutSeconds: 1
periodSeconds: 5
env:
- name: TZ
value: "Etc/UTC"
# TODO: move this to a secret?
- name: PLEX_CLAIM
valueFrom:
secretKeyRef:
name: plex-claim
key: claim
# kube-plex env vars
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: ADVERTISE_IP
value: |
https://plex.local:443
,https://plex.local
volumeMounts:
- name: data
mountPath: /data
- name: config
mountPath: /config
- name: transcode
mountPath: /transcode
- name: shared
mountPath: /shared
resources:
{}
nodeSelector:
kubernetes.io/arch: amd64
compatibility: plex
volumes:
- name: data
persistentVolumeClaim:
claimName: "plex-smb"
- name: config
persistentVolumeClaim:
claimName: "plex-smb-config"
- name: transcode
persistentVolumeClaim:
claimName: "plex-transcode"
- name: shared
emptyDir: {}
---
###Add Secret!!!
###
###kubectl create secret generic plex-claim --from-literal=claim="<Insert Secret Here>" -n plex
Sorry it is a bit extra to read, but I included all my external pv/pvc and some sc definitions inside it for easier management
That's actually perfect. It allows me to read through everything I need.
On first look everything looks exactly as it's supposed to. If the jobs don't get created, the only thing that comes to mind is that the transcode process is having issues connecting to the Kubernetes API. Which Kubernetes version are you using?
1.27.2
Is there a way to see the error message from the child process on the recursive call of kube-plex?
kube-plex tries to write to the Plex log (which can be seen from the Plex admin console), but of there is nothing in that log then the process dies before anything gets logged. The process should start logging early, so there should be something in the log regardless of the reason why kube-plex fails.
I also checked that there aren't any immediately obvious API changes in Kubernetes either (I'm still running 1.26.x), so that shouldn't be the case either.
Okay, I'll scale it down and nuke the probes and try again, may have just lost it in the sea of logs. Do you know if it would be a warning or error log in the console? Or would it just be a debug?
You should be able to find it by searching for Kube or KubePlex. The error level depends on the log level, so I would expect any errors to show up in error log.
Okay, I figured out that initial issue, I had the initContainer typoed in the annotation on my live code. Once I fixed that it started kicking off jobs and pods, but the streams still won't start, now I see this error in the console [Req#745b] [KubePlex] "level"=0 "msg"="Context terminated with error:%!(EXTRA *errors.errorString=context canceled)\n"
Interestingly, once this error happens, it will no longer spawn jobs/pods when I attempt to do subsequent streams
Actually disregard that last one, it just took a long time for it to re-attempt it. Still always exits with that context error before streaming
So after some more testing have a few more behaviors to add, maybe they'll help decode the issue.
I'm using hostNetwork: true
on the deployment, curious if that has been tested successfully
When I turn off hostNetwork it hangs for a while when attempting to launch a stream and then fails with a network error in the UI (same context error in the console)
Once the context error happens I have to refresh the page to get kube-plex to spawn a new job/pod when a stream is attempted, simply retrying to play the stream never spawns an elastic transcode job
I haven't tested kube-plex with hostNetwork: true
. An immediate issue I can think of is the fact that while the main deployment would respect hostNetwork
the transcoding jobs should ignore the setting completely. Depending on how the cluster has been set up, that might also mean that the plex service doesn't have connectivity to the plex instance.
I'm guessing the hostnetwork is on for local network discovery?
That all being said, if the networking works through the service resource I don't see any other reasons why this might be an issue.
Tracing through the code, the only way how the error Context terminated with error:%!(EXTRA *errors.errorString=context canceled)
could be reached is when Plex kills the transcode process. The wrapper is waiting for the transcode job to do its thing. I was going to propose to see if there are logs from this code, but it seems I never wrote logging for successful requests in an attempt to minimise noise
But I think we might be narrowing down on the issue.
Yeah, I use hostNetwork for a couple reasons. one if for the DLNA, the other is because if you don't, all traffic is presumed to be remote due to the ingress adding in an "unrecognized domain". I tried to fix this with that Custom access URL setting in the admin menu, but it kept doing it. For some reason hostNetwork was the only thing that would fix it. I was trying to understand exactly how the traffic flows in this example, but I assume that your browser/app still talks to the plex pod and the only traffic to/from the transcoder is between the plex pod and the elastic pods.
On a side note, you think the codecPackage is causing the context exits?
Yeah, I use hostNetwork for a couple reasons. one if for the DLNA, the other is because if you don't, all traffic is presumed to be remote due to the ingress adding in an "unrecognized domain".
Makes sense, this is due to the network detection. You should be able to fix this by adding the ingress pod IP range to your trusted networks in Plex settings.
For some reason hostNetwork was the only thing that would fix it. I was trying to understand exactly how the traffic flows in this example, but I assume that your browser/app still talks to the plex pod
When using Ingress, your browser talks to the ingress pod, which then opens a new connection to your Plex pod. This means that the caller IP address is going to be the IP of your ingress pod. Plex relies on the caller IP address to make decisions whether the connection is local or not. So when using Ingress, as long as the Ingress is considered local the incoming connection is considered local.
and the only traffic to/from the transcoder is between the plex pod and the elastic pods.
It's actually reversed. The transcoder pods talk back to the Plex pod. Plex assumes that the transcoder is running locally and is reporting back through its http interface. Kube-plex tricks the transcoder to talk to a supplicant which then forwards the traffic back to Plex via the service definition.
On a side note, you think the codecPackage is causing the context exits?
It's not likely that codecPackage is the origin of the issue. There is enough logging there that any error cases should be logged properly. I wanted to use the codec downloading to see if the transcoder pod is able to connect to the Plex pod in any way. The kube-plex wrapper seems to be running as intended, but the transcoder pod is having some issues either starting or talking back to the Plex pod.
Okay, I found a new error now in the logs [Req#1147] [KubePlex] "msg"="failed to create tar package: writing to tar failed: write tcp 192.168.2.212:33985->10.244.3.25:41218: write: broken pipe\n" "error"=null
that one mean anything?
That would mean that the download is trying to start, but cuts off mid download. Which is a bit odd as the tar gets loaded into memory and we're talking about a few megabytes of data, which shouldn't be an issue. Can you see if you can find any of the log lines from here in the logs?
Those should be output right before the codec download begins
Yessir, it makes it down to "Transcode begins..." before exiting on context
That error that I previously posted only happens when the plex pod is on hostNetwork
Hmm, that would mean that the issue is in the plex native transcoder then. That's when kube-plex hands over the controls back to the plex native processes. Let's try and crank the loglevel to max. Try adding kube-plex/loglevel: debug
annotation to your deployment. That should pass through the value to the transcoder process and log a lot more. If I remember correctly, the logs are listed as Transcoder or something similar. Maybe those logs give us the last hint on what is going wrong.
Okay, I'll kick off another one and filter for transcoder
So with that, this one is the one that sticks out to me. Keep in mind I disabled hostNetwork for these tests Completed: [10.244.2.0:43628] 404 GET /video/:/transcode/universal/session/oxar7lggbpmbfkt8o7zul4hu/1/header (19 live) #7d1 GZIP 142570ms 379 bytes (pipelined: 1)
I wonder if that is your browser making that call. It looks pretty similar to what the transcoder will post to the Plex process, but the fact that this is a get request makes me think of a browser rather than the transcoder.
Let's try and take a closer look at the transcoder command. There should be a log line in the lines of Transcode requested with command
(feel free to sanitize it before pasting).
[Req#6f5] [KubePlexProxy] "level"=0 "msg"="Transcode requested with command /usr/lib/plexmediaserver/Plex Transcoder, args = [-loglevel debug -loglevel_plex debug -codec:0 h264 -codec:1 dca -analyzeduration 20000000 -probesize 20000000 -i /data/movies/Push.mkv -filter_complex [0:0]scale=w=720:h=404:force_divisible_by=4[0];[0]format=pix_fmts=yuv420p|nv12[1] -map [1] -metadata:s:0 language=eng -codec:0 libx264 -crf:0 21 -maxrate:0 1724k -bufsize:0 3448k -r:0 23.975999999999999 -preset:0 veryfast -x264opts:0 subme=2:me_range=4:rc_lookahead=20:me=hex -force_key_frames:0 expr:gte(t,n_forced*3) -filter_complex [0:1] aresample=async=1:ochl='stereo':rematrix_maxval=0.000000dB:osr=48000[2] -map [2] -metadata:s:1 language=eng -codec:1 aac -b:1 162k -f dash -seg_duration 3 -dash_segment_type mp4 -init_seg_name init-stream$RepresentationID$.m4s -media_seg_name chunk-stream$RepresentationID$-$Number%05d$.m4s -window_size 5 -delete_removed false -skip_to_segment 1 -time_delta 0.0625 -manifest_name http://127.0.0.1:32400/video/:/transcode/session/oxar7lggbpmbfkt8o7zul4hu/c2752f3e-e00e-4153-9c01-7ed58cabc2af/manifest?X-Plex-Http-Pipeline=infinite -avoid_negative_ts disabled -map_metadata -1 -map_chapters -1 dash -start_at_zero -copyts -vsync cfr -y -nostats -loglevel quiet -loglevel_plex error -progressurl http://127.0.0.1:32400/video/:/transcode/session/oxar7lggbpmbfkt8o7zul4hu/c2752f3e-e00e-4153-9c01-7ed58cabc2af/progress]\n"
Ok, I'm stumped. That looks exactly as it is supposed to. So we know that the communications between main pod and transcoder works (we can see it from the logs). We know that the kubernetes scheduling is working as intended since the launcher starts. So I don't see any reason why this wouldn't work.
Does the transcode pod create anything in the transcode share? usually it creates some temporary files there very early in the transcode process.
I see it create up to the session directory, but never anything inside of it, Like this folder gets created /transcode/Transcode/Sessions/plex-transcode-y1fdnrjbltsky5swks0lts32-99c520c8-4cf7-4209-966c-ec77432a1876/
but remains empty
I can see a "WebSocket: client initiated close" not sure if that's related
I was finally able to get a couple things to transcode, but the delay is so long that it only works in the web browser. How long is the normal amount of delay for a transcode to begin?
The common transcode delay for me is 1-2 seconds which I've usually chalked up to slow system. And I haven't had issues with TV and other clients.
Do you have any idea what could have been the thing that got things to finally move forward? Maybe there is something we can improve in either documentation or in the code.
Yeah, so I definitely couldn't use hostNetwork, because then the plex couldn't communicate with the elastic transcoders. I think once I cleaned that up and then put a port forward in my firewall it started working as upnp won't auto-configure since that's a broadcast thing. At least it won't propagate outside the network namespace plex is running in. As far as delay, I'm getting like 45+ seconds of delay, it also only appears to work on TV shows, when I try to do movies it just hangs too long. Not sure where the hang up is. I'm definitely willing to research it more, but I'm having trouble finding any kind of good documentation on how the interactions between the primary plex process and the transcode processes work normally.
45 seconds is definitely long. I've been trying to think of ways to speed up the transcoder startup. One option I can think of is pre-creating the transcoders and using an RPC call to trigger the transcode on the other end. That way there is no delay from creating the job, it would also allow the pods to stay around for longer in case we need to troubleshoot the contents for any reason.
Yeah, I had a similar thought about prestaging. Though I want to do some more research on the Plex container, kinda sucks they don't have any documentation on their transcode binary, just whatever is in the help.
Think I may have found another problem I was having, SQLite over smb is super slow/unstable, allegedly. Haven't had a chance to test a block-level storageclass yet. What type of storage class do you use for the /config dir?
I'm using an iSCSI storage from my NAS as the storage for config. I know that there are some users using SMB for their storage without major issues. NFS has serious issues though, so I wouldn't recommend using NFS as storage.
After switching to a block storage-based pvc, plex seems dramatically faster. I'll retest the elastic transcoders tomorrow, but I assume that'll have solved my core issue I was having.
So I retested with the new pvc and it still takes a little more than a minute between the initial play request and when the stream starts. Since this is getting pretty far out of scope of my initial issue, I'm going to close this as solved with the takeaway that you get the initial error if it can't find a matching init container name. I'll open a new ticket for the latency issues.
Kubernetes: v1.27.2 plex version: plexinc/pms-docker:1.32.3.7192-7aa441827
When attempting to stream a movie, plex fails with "Protected process returned an error: exit status 1"
This is the only log generated that I can see related to kube-plex. Even attempting to manually execute the binary from inside the container results in this error with no other output. I've tried both the ghcr.io/ressu/kube-plex:latest, and building this repo's main branch and both result in the same error.
I am using SMB-based PVCs for config, data, and transcode. Though I'm not sure that's really relevant sense it's not even getting that far.
I forgot to add that no jobs or transcode pods are ever generated.