munnerz / kube-plex

Scalable Plex Media Server on Kubernetes -- dispatch transcode jobs as pods on your cluster!
Apache License 2.0
1.22k stars 193 forks source link

Reworking of kube-plex #113

Closed ressu closed 3 years ago

ressu commented 3 years ago

I'll admit, this is a "scratch my own itch" type of a solution. But when trying to make use of kube-plex, I ended up having a few issues. So I reworked quite a bit of it.

I'm dropping this PR as a heads up that this work has been done, I'm more than happy to try and break it down to smaller chunks for merging. Some changes included (in no specific order):

I'm also dropping vendor directory from the repository. Personally I prefer to have it, but it tends to mess up pull requests. I've also constructed all the Dockerfiles etc in a way where it's completely fine to either run everything as is, or by pre-caching the modules with go mod vendor.

Bakies commented 3 years ago

How's this working? I'd love to try this out, but might just end up trying unicorn instead.

ressu commented 3 years ago

I'm using this as my daily driver, so I'm not aware of any issues.

brandon099 commented 3 years ago

@ressu I'm interested in testing this out too. Other than installing the Helm Chart from your fork, are there different docker images needed to fully utilize your changes? I've tried just installing the helm chart, but it's failing to start transcode pods, and I read that your changes included switching to transcode jobs instead, so I must be missing something. Thanks for this work!

ressu commented 3 years ago

You can use the image I built here https://github.com/ressu/kube-plex/pkgs/container/kube-plex

So the relevant parts of my helm values are:

image:
  tag: latest
kubePlex:
  enabled: true
  image:
    repository: ghcr.io/ressu/kube-plex
brandon099 commented 3 years ago

I get the following error when /shared/kube-plex binary is invoked to spin up transcode jobs, after using the provided image (which is progress from where I previously was getting šŸ˜„ ) Protected process returned an error: exit status 1

I haven't been able to determine where that message is coming from, yet. Any thoughts? I am running K3s at 1.21.1, so not sure if there is something newer in Kubernetes that might be causing this.

Here are the Plex pod logs, and you can see my two attempts to transcode something.

$ kubectl logs plex-kube-plex-5585bc9f59-m49vw 
[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
[s6-init] ensuring user provided files have correct perms...exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 40-plex-first-run: executing... 
Plex Media Server first run setup complete
[cont-init.d] 40-plex-first-run: exited 0.
[cont-init.d] 45-plex-hw-transcode-and-connected-tuner: executing... 
[cont-init.d] 45-plex-hw-transcode-and-connected-tuner: exited 0.
[cont-init.d] 50-plex-update: executing... 
[cont-init.d] 50-plex-update: exited 0.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
Starting Plex Media Server.
Critical: libusb_init failed
Protected process returned an error: exit status 1
Protected process returned an error: exit status 1
ressu commented 3 years ago

Hmm.. I'm not fully certain if that's relevant to the issue. I have to admit that I've never checked the plex console logs once I got everything working. What I would suggest is enabling logging in kubeplex by adding loglevel verbose to the values (similar to this)

image:
  tag: latest
kubePlex:
  enabled: true
  image:
    repository: ghcr.io/ressu/kube-plex
  loglevel: verbose

That will make kubeplex log more information. You can see the logs in the plex web UI, by going to settings -> manage -> console and filtering for transcode (if I remember correctly). This will show the logs from kubeplex directly and should give you an idea of what is going wrong.

brandon099 commented 3 years ago

So I made it a little farther with the assist of the verbose logging (thank you!). What that showed was insufficient permissions with the role kube-plex uses, so I had to update the role with the batch API group for jobs resources.

diff --git a/charts/kube-plex/templates/rbac.yaml b/charts/kube-plex/templates/rbac.yaml
index a327770..e59ae87 100644
--- a/charts/kube-plex/templates/rbac.yaml
+++ b/charts/kube-plex/templates/rbac.yaml
@@ -27,6 +27,19 @@ rules:
   - patch
   - update
   - watch
+- apiGroups:
+  - batch
+  resources:
+  - jobs
+  verbs:
+  - create
+  - delete
+  - deletecollection
+  - get
+  - list
+  - patch
+  - update
+  - watch

Now, I'm getting errors with the transcoder launching, and it looks like its maybe related to an ffmpeg flag, maybe?

Jun 21, 2021 10:59:44.793 [0x7f5f47277b38] Debug ā€” [Transcoder] [AVHWDeviceContext @ 0x7ff9e5845700] Cannot open a VA display from DRM device (null).
Jun 21, 2021 10:59:44.793 [0x7f5f47325b38] Error ā€” [Transcoder] Device creation failed: -542398533.
Jun 21, 2021 10:59:44.793 [0x7f5f47277b38] Error ā€” [Transcoder] Failed to set value 'vaapi=vaapi:' for option 'init_hw_device': Generic error in an external library
Jun 21, 2021 10:59:44.794 [0x7f5f47325b38] Error ā€” [Transcoder] Error parsing global options: Generic error in an external library
Jun 21, 2021 10:59:44.795 [0x7f5f47277b38] Error ā€” [KubePlexProxy] transcode failed [error:exit status 1]
Jun 21, 2021 10:59:45.545 [0x7f5f47325b38] Info ā€” [KubePlex] Error waiting for pod to complete: job "pms-elastic-transcoder-2kvxm" failed

I did test the ability for a pod to schedule GPU (using an Ubuntu OpenCL test pod) and it was able to schedule the pod and utilize the GPU successfully.

ressu commented 3 years ago

Oh, that makes sense.. I don't think my cluster has RBAC enabled (I should turn it on though) which explains the missing permissions. I'll add those a bit later in the week to my branch.

I haven't tried my codebase with GPU support, could you send an example of GPU config for me so I know what to look for and pick it up for the transcoder.

brandon099 commented 3 years ago

I'm running the Intel Device GPU Plugin daemonset configured like the following:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: intel-gpu-plugin
  name: intel-gpu-plugin
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: intel-gpu-plugin
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: intel-gpu-plugin
    spec:
      containers:
      - env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: intel/intel-gpu-plugin:0.21.0
        imagePullPolicy: IfNotPresent
        name: intel-gpu-plugin
        resources: {}
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /dev/dri
          name: devfs
          readOnly: true
        - mountPath: /sys/class/drm
          name: sysfs
          readOnly: true
        - mountPath: /var/lib/kubelet/device-plugins
          name: kubeletsockets
      dnsPolicy: ClusterFirst
      nodeSelector:
        feature.node.kubernetes.io/pci-0300_8086.present: "true"
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /dev/dri
          type: ""
        name: devfs
      - hostPath:
          path: /sys/class/drm
          type: ""
        name: sysfs
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: kubeletsockets
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

Then the pod resources for requests and limits need to request the GPU resource like so:

    resources:
      requests:
        gpu.intel.com/i915: 1
        cpu: 1000m
        memory: 1500Mi
      limits:
        gpu.intel.com/i915: 1

As I was writing this about the resource requests, I am now wondering if it might be that the GPU resource request needs to be added to the job spec that creates the pms-elastic-transcoder-* job pods in cmd/kube-plex/kubernetes.go, to make the GPU available to the elastic transcoder pods.

ressu commented 3 years ago

Yeah, the requests needs to be updated. I'm assuming that this is an exclusive lock. As in, there is just a single GPU instance available, which means that the easy solution is to add a mechanism to define additional resource requests/limits in the transcoder job. That should be easy enough to fix.

I'll add that later this week along with the RBAC

ressu commented 3 years ago

Alright, I'm working on the GPU support in this PR https://github.com/ressu/kube-plex/pull/1

The change itself turned out to be a bit more involved than I expected, but should work. I'll do some more testing and rebuild images on my side.

ressu commented 3 years ago

Sigh, renaming master branch to main closed this PR. I'll open a new one shortly..