pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
445 stars 36 forks source link

Can't Transcode to workers. #332

Closed Nullvoid3771 closed 1 week ago

Nullvoid3771 commented 2 weeks ago

Not sure how to resolve this error in plex logs.

Little insight on my setup I'm using proxmox to host vms. 2 vm's total on two nodes on different systems running in docker swarm.

I've attempted to disable firewalls & allow packet forwarding on all system hosts / vms. no luck. any idea?

NetworkServiceBrowser: Error sending out discover packet from 10.0.0.39 to 10.0.0.255: Operation not permitted

Nullvoid3771 commented 2 weeks ago

also getting so probably related.

JobPoster connected, announcing

Orchestrator requesting pending work

Sending request to orchestrator on: http://plex-orchestrator:3500

Distributed transcoder failed, calling local

Nullvoid3771 commented 2 weeks ago

Received task request EAE_ROOT => "/tmp/pms-2bad85a5-e295-42ce-b5f6-3e07de05f105/EasyAudioEncoder" EAE Support - EAE already running CWD path doesn't seem to exist. Plex should have created this path before-hand, so you may have an issue with your shares => "/config/Library/Application Support/Plex Media Server/Cache/Transcode/Sessions/plex-transcode-26k61s2ohb6ux3zspu2hbe3s-37bd8433-38a8-47f5-8e1f-e0c23e838168" Transcoding failed: Error: spawn /usr/lib/plexmediaserver/Plex Transcoder ENOENT at ChildProcess._handle.onexit (node:internal/child_process:286:19) at onErrorNT (node:internal/child_process:484:16) at process.processTicksAndRejections (node:internal/process/task_queues:82:21) { Orchestrator notified Removing process from taskMap errno: -2, code: 'ENOENT', syscall: 'spawn /usr/lib/plexmediaserver/Plex Transcoder', path: '/usr/lib/plexmediaserver/Plex Transcoder', spawnargs: [ '-codec:0', 'hevc', '-codec:1', 'eac3_eae', '-eaeprefix:1', '26k61s2ohb6ux3zspu2hbe3s', '-ss', '938', '-analyzeduration', '20000000', '-probesize', '20000000', '-i', '/data/movies/Movies/[Redacted], '-filter_complex', '[0:0]scale=w=720:h=360:force_divisible_by=4[0];[0]format=pix_fmts=yuv420p|nv12[1]', '-map', '[1]', '-codec:0', 'libx264', '-crf:0', '20', '-maxrate:0', '1712k', '-bufsize:0', '3424k', '-r:0', '30', '-preset:0', 'veryfast', '-x264opts:0', 'subme=3:me_range=4:rc_lookahead=10:me=hex', '-filter_complex', "[0:1] aresample=async=1:ochl='stereo':rematrix_maxval=0.000000dB:osr=48000[2]", '-map', '[2]', '-metadata:s:1', 'language=eng', '-codec:1', 'libopus', '-b:1', '178k', '-f', 'segment', '-segment_format', 'matroska', '-segment_format_options', 'live=1', '-segment_time', '1', '-segment_header_filename', 'header', '-segment_start_number', '0', '-segment_list', 'http://plex:32499/video/:/transcode/session/26k61s2ohb6ux3zspu2hbe3s/37bd8433-38a8-47f5-8e1f-e0c23e838168/manifest?X-Plex-Http-Pipeline=infinite', '-segment_list_type', 'csv', '-segment_list_unfinished', '1', '-segment_list_size', '5', '-segment_list_separate_stream_times', '1', '-avoid_negative_ts', 'disabled', '-map_metadata:g', '-1', '-map_metadata:c', '-1', '-map_chapters', '-1', 'chunk-%05d', '-start_at_zero', '-copyts', '-y', '-nostats', '-loglevel', 'verbose', '-loglevel_plex', 'verbose', '-progressurl', 'http://plex:32499/video/:/transcode/session/26k61s2ohb6ux3zspu2hbe3s/37bd8433-38a8-47f5-8e1f-e0c23e838168/progress' ] } Transcoder close: child process exited with code -2

Nullvoid3771 commented 2 weeks ago

hm maybe a permission issue

edit: yes issue with me not specifying a /transcode folder and the workers not being able to read so defaulting to local pms.

specifying a /transcode folder seems to be a issue with smb will look at other drives such as cephs for this.

pabloromeo commented 2 weeks ago

Ah, yes, network shares are mandatory. For hardware transcoding support there are a few others, such as Drivers and Cache. There are other closed issues here describing those extra network shares.

pabloromeo commented 2 weeks ago

The "error sending out discovery packet" I believe is a non-issue. Can't remember exactly what plex functionality that is (maybe discovering of client players through udp broadcasts or something like that, which wouldn't work on docker bridge networks), but it doesn't affect the application.

Nullvoid3771 commented 2 weeks ago

@pabloromeo

Ah, yes, network shares are mandatory. For hardware transcoding support there are a few others, such as Drivers and Cache. There are other closed issues here describing those extra network shares.

Can you confirm this setup for docker swarm is correct I've added comments below to explain in more detail. Unable to get workers to transcode.

plexnet plexnet2 secure

version: '3.8'

services: plex: image: ghcr.io/linuxserver/plex:latest deploy: mode: replicated replicas: 1 environment: DOCKER_MODS: "ghcr.io/pabloromeo/clusterplex_dockermod:latest" VERSION: docker PUID: 1000 PGID: 1000 TZ: America/Toronto ORCHESTRATOR_URL: http://plex-orchestrator:3500 PMS_SERVICE: plex # This service. If you disable Local Relay then you must use PMS_IP instead PMS_PORT: "32400" TRANSCODE_OPERATING_MODE: local #(local|remote|both) #Works only in local PMS mode. TRANSCODER_VERBOSE: "1" # 1=verbose, 0=silent LOCAL_RELAY_ENABLED: "1" LOCAL_RELAY_PORT: "32499" FORCE_HTTPS: "0" devices: #Not needed because only nvidia gpus work in docker swarm apparently.

==============================================================

Samba config.

[global] writeable = yes passwd program = /usr/bin/passwd %u read raw = no max log size = 1000 default = global load printers = no logging = file inherit acls = yes panic action = /usr/share/samba/panic-action %d delete readonly = yes pam password change = no server string = %h server (Samba, Ubuntu) aio write size = 1 min protocol = smb2 getwd cache = yes large readwrite = yes netbios name = RackServer2 write raw = no bind interfaces only = yes aio read size = 1 log level = 1 enhanced browsing = Yes server role = standalone server locking = no log file = /var/log/samba/log.%m workgroup = WORKGROUP map to guest = bad user kernel oplocks = no interfaces = "192.168.1.x/255.255.255.0;speed=2000000,capability=RSS" "169.254.x.x/255.255.0.0;speed=5000000,capability=RSS" "127.0.0.1/255.0.0.0;speed=5000000,capability=RSS" local master = yes revalidate = yes encrypt passwords = true server signing = mandatory use sendfile = true os level = 20 server multi channel support = yes inherit owner = yes inherit permissions = yes unix password sync = yes smb encrypt = mandatory obey pam restrictions = yes usershare allow guests = yes socket options = SO_BROADCAST TCP_NODELAY IPTOS_LOWDELAY oplocks = no client signing = mandatory

[PlexConfig] user = [Redacted],nobody,nogroup,[Redacted] force create mode = 0777 valid users = [Redacted],[Redacted],[Redacted],nobody write list = nobody,[Redacted],[Redacted],nogroup path = /media/MoviesOther/[Redacted]-NAS/Plex create mode = 0777 directory mode = 0777 force directory mode = 0777 vfs objects = io_uring

Media folders work as intended so didn't include them.

Nullvoid3771 commented 2 weeks ago

Is theirs any additional ports that need to be added that things listen on that maybe getting blocked by firewall?

Nullvoid3771 commented 2 weeks ago

plexerror

I've tried adding a /drivers folder or even a /config folder for the workers no go. with or without these I get this error above.

Nullvoid3771 commented 2 weeks ago

@pabloromeo Maybe this is a issue with Intel-Quicksync. Because devices: /dev/dri:dev/dri is not used in docker swarm it cannot look for video drivers for the gpu/cpu? So maybe add a caveat that states that this will not work on amd/intel-quicksync or any gpu Except NVIDIA. Unless you have more info on how to fix this, but perhaps a issue with Proxmox virtualisation.

Edit: I've also added FFMPEG_HWACCEL: VAAPI to the worker and the main pms. I think the error libcuda.so.1 is maybe a non-issue as far as I can tell after further reading. Without hardware acceleration it should default to software encoding/decoding. But attempting to transcode even with software just errors with really no explanation in logs, just errors when set to remote for workers. So I'm at a loss of the issue here.

I've also attempted to bypass the local relay with PMS_IP.. so I don't think it's a communication issue. I've even attempted to disable firewalls.

Annoyingly though I don't get anything like 'Received task request' from the workers.

Nullvoid3771 commented 2 weeks ago

This here is apparently a workaround for the issue of docker swarm not working with intel-quicksync probably needs to be modified. Edit: Didn't work for me.

https://www.linkedin.com/pulse/docker-swarm-reducing-plex-cpu-utilisation-60-reis-holmes

https://pastebin.com/XY7GP18T

Nullvoid3771 commented 2 weeks ago

Orchestrator receives the task, but the workers do not. No idea why.

Initializing orchestrator

Using Worker Selection Strategy: LOAD_RANK

Stream-Splitting: DISABLED

Setting up websockets

Ready

Server listening on port 3500

Client connected: w6IQ1IDAne2vCNB-AAAB

Registering worker 1e73f8be-901b-42c7-866b-3ac77db2d12d|[Redacted-pcname]

Registered new worker: 1e73f8be-901b-42c7-866b-3ac77db2d12d|[Redacted-pcname]

Client connected: vkKB0peRsDr67e8FAAAD

Registering worker c5a98a53-3d57-4016-aafa-e51b736e1ec6|[Redacted-pcname]

Registered new worker: c5a98a53-3d57-4016-aafa-e51b736e1ec6|[Redacted-pcname]

Client connected: 6ED7DeHQdp4vH_PMAAAF

Registered new job poster: 3e3c8a60-a4f6-4a53-b55d-d674e512dfec|12cb24533837

Creating single task for the job

Queueing job d9cd4b5e-b6bc-4a8b-81e7-648d9bd58c33

Queueing task 6b440e6b-8587-4e19-8d12-87d0006675e2

Running task 6b440e6b-8587-4e19-8d12-87d0006675e2

Forwarding work request to 1e73f8be-901b-42c7-866b-3ac77db2d12d|[redacted]

Received update for task 6b440e6b-8587-4e19-8d12-87d0006675e2, status: received

Client disconnected: w6IQ1IDAne2vCNB-AAAB

Unregistering worker at socket w6IQ1IDAne2vCNB-AAAB

Killing pending tasks for worker: 1e73f8be-901b-42c7-866b-3ac77db2d12d

Received update for task 6b440e6b-8587-4e19-8d12-87d0006675e2, status: done

Task 6b440e6b-8587-4e19-8d12-87d0006675e2 complete, result: false

Task 6b440e6b-8587-4e19-8d12-87d0006675e2 complete

Job d9cd4b5e-b6bc-4a8b-81e7-648d9bd58c33 complete, tasks: 1, result: false

JobPoster notified

Removing job d9cd4b5e-b6bc-4a8b-81e7-648d9bd58c33

Job d9cd4b5e-b6bc-4a8b-81e7-648d9bd58c33 complete

Unregistering worker 1e73f8be-901b-42c7-866b-3ac77db2d12d|plex-worker-[redacted-pcname]

Client disconnected: 6ED7DeHQdp4vH_PMAAAF

Removing job-poster 3e3c8a60-a4f6-4a53-b55d-d674e512dfec|12cb24533837 from pool

Nullvoid3771 commented 2 weeks ago

Interestingly whats above isn't even my workers names.

clusteredplex_plex-worker.1.dvctrexq0pdhxuoaoohjnwo8e

clusteredplex_plex-worker.2.z50z3t6z2mknkajnncrhj7o79

Edit: they connect on socket. so the socket names are correct.

Nullvoid3771 commented 2 weeks ago

@pabloromeo

JobPoster connected, announcing

Orchestrator requesting pending work

Sending request to orchestrator on: http://plex-orchestrator:3500

Distributed transcoder failed, calling local

Error transcoding and local transcode is disabled: TRANSCODE_OPERATING_MODE=remote


Setting /transcode to pms & worker on the same node and setting worker to replication 1 means everything will be local to the master node this was done as a test to rule out mounting issues. Ultimately showing the worker IS the issue because it is NOT getting the request. Now their is no firewall active everything should be working and communicating. Plex is set in settings to the ingress ip for pms. The issue might be the /transcode folder I'm not sure, but to my understanding it is setup correctly. 0755 with user:user (redacted username)

ingress

volume

ingress2

ingress 3

I can access pms from master node for swarms ip 192.168.1.x:32400 I assume this is the correct way since pms lives on this node.

volumes:
       - type: bind               
         source: $HOST/var/lib/docker/volumes/plexvolume/_data
         target: /config

       - type: bind                        
         source: $HOST/var/lib/docker/volumes/plexvolume/_data/transcode
         target: /transcode
Nullvoid3771 commented 2 weeks ago

Could the issue be that Portainer stacks creates a different subnet from ingress?

The stack created from https://github.com/pabloromeo/clusterplex/blob/master/docs/docker-swarm/with-dockermods.yaml

The stack name I called clusteredplex.

creates a new network called clusteredplex_default on 10.0.2.0/24 where ingress is on 10.0.0.0/24 and clusteredplex_default is on 10.0.2.0/24 in its preferred network setting and 172.18.0.0/16 is docker_gwbridge (not sure what that is I seem to have a ton of bridges?)

IMG_0214

IMG_0213

Nullvoid3771 commented 2 weeks ago

Creating this Clusterplex is probably meant to be done outside of portainer, but I used portainer to make my life easier. Maybe that’s the wrong way to go about this. A tutorial would be appreciated and really helpful since I can only interpret your instructions to be built in portainer swarm.

Nullvoid3771 commented 1 week ago

Fixed was forced to use Cephs. SMB had too many permission issues. See: https://github.com/pabloromeo/clusterplex/issues/335