pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
468 stars 36 forks source link

ENOENT on Worker #41

Closed bfvogel closed 3 years ago

bfvogel commented 3 years ago

OUTPUT from Worker logs:

################################################## ON_DEATH: debug mode enabled for pid [8361] Computed CPU ops => 763017 Initializing Worker 7a3d7913-ba42-4318-bbef-dd70ef23fafa|US4-Plex2 Worker listening on port 3501 Worker connected on socket GF3epqq76pvLIxZEAABx Received task request Transcoding failed: { Error: spawn /usr/lib/plexmediaserver/Plex Transcoder ENOENT at Process.ChildProcess._handle.onexit (internal/child_process.js:240:19) at onErrorNT (internal/child_process.js:415:16) at process._tickCallback (internal/process/next_tick.js:63:19) errno: 'ENOENT', code: 'ENOENT', syscall: 'spawn /usr/lib/plexmediaserver/Plex Transcoder', path: '/usr/lib/plexmediaserver/Plex Transcoder', spawnargs: [ '-codec:0', 'h264', '-codec:1', 'ac3', '-analyzeduration', '20000000', '-probesize', '20000000', '-i', '/storage/public/Movies/Movies/20,000 Leagues Under the Sea (1954)/20,000 Leagues Under the Sea (1954).mkv', '-filter_complex', '[0:0]scale=w=720:h=282[0];[0]format=pix_fmts=yuv420p|nv12[1]', '-map', '[1]', '-codec:0', 'libx264', '-crf:0', '18', '-maxrate:0', '1686k', '-bufsize:0', '3372k', '-r:0', '23.975999999999999', '-preset:0', 'veryfast', '-x264opts:0', 'subme=2:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none', '-force_key_frames:0', 'expr:gte(t,n_forced3)', '-filter_complex', '[0:1] aresample=async=1:ocl=\'stereo\':rematrix_maxval=0.000000dB:osr=48000[2]', '-map', '[2]', '-metadata:s:1', 'language=eng', '-codec:1', 'aac', '-b:1', '217k', '-f', 'dash', '-seg_duration', '3', '-init_seg_name', 'init-stream$RepresentationID$.m4s', '-media_seg_name', 'chunk-stream$RepresentationID$-$Number%05d$.m4s', '-window_size', '5', '-delete_removed', 'false', '-skip_to_segment', '1', '-time_delta', '0.0625', '-manifest_name', 'http://192.168.116.147:32400/video/:/transcode/session/bos1g6uj5nmgmr0x2x4xtan2/ee6e5090-687b-4e0f-8a9b-e2ddc3342240/manifest?X-Plex-Http-Pipeline=infinite', '-avoid_negative_ts', 'disabled', '-map_metadata', '-1', '-map_chapters', '-1', 'dash', '-start_at_zero', '-copyts', '-vsync', 'cfr', '-y', '-nostats', '-loglevel', 'verbose', '-loglevel_plex', 'verbose', '-progressurl', 'http://192.168.116.147:32400/video/:/transcode/session/bos1g6uj5nmgmr0x2x4xtan2/ee6e5090-687b-4e0f-8a9b-e2ddc3342240/progress' ] } Orchestrator notified Removing process from taskMap Received task request Transcoding failed: { Error: spawn /usr/lib/plexmediaserver/Plex Transcoder ENOENT at Process.ChildProcess._handle.onexit (internal/child_process.js:240:19) at onErrorNT (internal/child_process.js:415:16) at process._tickCallback (internal/process/next_tick.js:63:19) errno: 'ENOENT', code: 'ENOENT', syscall: 'spawn /usr/lib/plexmediaserver/Plex Transcoder', path: '/usr/lib/plexmediaserver/Plex Transcoder', spawnargs: [ '-codec:0', 'h264', '-codec:1', 'ac3', '-ss', '120', '-analyzeduration', '20000000', '-probesize', '20000000', '-i', '/storage/public/Movies/Movies/20,000 Leagues Under the Sea (1954)/20,000 Leagues Under the Sea (1954).mkv', '-filter_complex', '[0:0]scale=w=720:h=282[0];[0]format=pix_fmts=yuv420p|nv12[1]', '-map', '[1]', '-codec:0', 'libx264', '-crf:0', '18', '-maxrate:0', '1686k', '-bufsize:0', '3372k', '-r:0', '23.975999999999999', '-preset:0', 'veryfast', '-x264opts:0', 'subme=2:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none', '-force_key_frames:0', 'expr:gte(t,n_forced3)', '-filter_complex', '[0:1] aresample=async=1:ocl=\'stereo\':rematrix_maxval=0.000000dB:osr=48000[2]', '-map', '[2]', '-metadata:s:1', 'language=eng', '-codec:1', 'aac', '-b:1', '217k', '-f', 'dash', '-seg_duration', '3', '-init_seg_name', 'init-stream$RepresentationID$.m4s', '-media_seg_name', 'chunk-stream$RepresentationID$-$Number%05d$.m4s', '-window_size', '5', '-delete_removed', 'false', '-skip_to_segment', '41', '-time_delta', '0.0625', '-manifest_name', 'http://192.168.116.147:32400/video/:/transcode/session/bos1g6uj5nmgmr0x2x4xtan2/c12e8fcb-e6c3-4f24-9822-600d969b8b25/manifest?X-Plex-Http-Pipeline=infinite', '-avoid_negative_ts', 'disabled', '-map_metadata', '-1', '-map_chapters', '-1', 'dash', '-start_at_zero', '-copyts', '-y', '-nostats', '-loglevel', 'verbose', '-loglevel_plex', 'verbose', '-progressurl', 'http://192.168.116.147:32400/video/:/transcode/session/bos1g6uj5nmgmr0x2x4xtan2/c12e8fcb-e6c3-4f24-9822-600d969b8b25/progress' ] } Orchestrator notified Removing process from taskMap ################################################

I have tried the following: Verified permission on the executable : /usr/lib/plexmediaserver/Plex Transcoder. Verified that plex is being installed to the correct path IE the /usr/lib/plexmediaserver/Plex Transcoder exists. Simplified path to transcoder to /app/transcoder AND updated config variables with exact same results. Initiated remote transcode using 'testpayload' argument which is successful. Initiated remote transcode with minimal arguments IE "-i /movies/movie.mp4": The transcoder is executing but fails due to missing arguments.

I think this is related to how node is parsing one of the current Plex arguments being passed. as simplifying the argument seems to get the transcoder to execute on the worker but if an entire normal argument set from Plex is used the worker is unable to execute the transcoder correctly and thinks the path to the executable does not exist.

bfvogel commented 3 years ago

[root@US4-Plex2 ~]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.3 (Ootpa) [root@US4-Plex2 ~]# node -v v10.24.0

bfvogel commented 3 years ago

[root@US4-Plex2 ~]# rpm -qa | grep plex plexmediaserver-1.22.2.4282-a97b03fad.x86_64

bfvogel commented 3 years ago

Looks like Unicorn Transcoder might have had a similar bug: https://github.com/UnicornTranscoder/UnicornTranscoder/issues/31

pabloromeo commented 3 years ago

Are you using any of the Docker images built from this repo? Is it something that was working and started to fail with Plex's latest update (v1.22)? Are your Plex server and the workers using the same architecture? Are you sure your worker has access to the path where the movies are stored? IE /storage/public/Movies/Movies/

I just launched a transcode request on the latest version running on ARM and it worked as expected and was able to call the trancoder on v1.22.2.4282.

bfvogel commented 3 years ago

I am not using the docker images so I know that complicates things a bit.

I rebuilt from scratch today and came to the same results but continued with the deployment and figured out what but not why it was failing. The codecs path that the start.sh script on the worker pulls from does not read the strings for ${PLEX_ARCH} from the strings of the current version of "/usr/lib/plexmediaserver/Plex Media Server" correctly. I don't know if Plex changed the string but it was not matching the case parameters.

Modifications:

32 *) 33 CLUSTERPLEX_PLEX_CODEC_ARCH="linux-x86_64-standard" 34

So I am guessing that the ENOENT was from the FFMPEG_EXTERNAL_LIBS directive as things now launch as expected after fixing start.sh discovery.

pabloromeo commented 3 years ago

Ah, I see. Do you happen to have the console output log for the start.sh script? It outputs the values it has extracted from the Plex executable and the paths it configures.

Ultimately what that script does is attempt to identity the current architecture and plex version of the worker, and fetch the codecs on startup for it. You could even have a shared location for the codecs if you have multiple workers, and they'd only get downloaded once per architecture and version (since it takes a few seconds/minutes to download all of them from scratch).

That's because the Plex Media Server and the Worker may be different architectures, and hence different codec binaries. For example, I've run PMS on 32bit ARM (a rpi 4), but a Worker on an Intel 64bit machine and gotten it to transcode for that PMS correctly.

Regarding the PLEX_ARCH env variable, if I'm not mistaken (I wrote it quite some time ago, so I could be remembering wrong), actually is set by the docker base images I'm using, which are the linuxserver ones. Since you are running outside of docker maybe you are missing a few ENV variables that this script expects.

shokohsc commented 3 years ago

Hello @pabloromeo, great project ! I'm having the same issue with all your images at the latest tag from about a week ago. Using this project with k8s-at-home plex chart, overriding values and adding deployments & services for the orchestrator & worker parts on a bare metal kubernetes cluster.

Curious that I'm having this ENOENT error because binary is there at the right path on each worker Would that comes from the fact that the directory is owned by root ? even though it is executable by 'others' ?

root@worker-6559f6b44f-t8vlm:/app# ls -lah /usr/lib/plexmediaserver/ total 44M drwxr-xr-x 2 root root 4.0K Apr 5 21:19 "'." drwxr-xr-x 5 root root 4.0K Apr 7 01:07 . drwxr-xr-x 1 root root 4.0K Apr 11 17:44 .. -rwxr-xr-x 1 root root 1.6M Apr 5 21:17 CrashUploader drwxr-xr-x 3 root root 4.0K Apr 7 01:07 lib -rwxr-xr-x 1 root root 399K Apr 5 20:54 'Plex Commercial Skipper' -rwxr-xr-x 1 root root 4.3M Apr 5 21:13 'Plex DLNA Server' -rwxr-xr-x 1 root root 135K Apr 5 20:54 'Plex Media Fingerprinter' -rwxr-xr-x 1 root root 7.4M Apr 5 21:17 'Plex Media Scanner' -rwxr-xr-x 1 root root 25M Apr 5 21:19 'Plex Media Server' -rwxr-xr-x 1 root root 730K Apr 5 20:54 'Plex Relay' -rwxr-xr-x 1 root root 13K Apr 5 20:54 'Plex Script Host' -rwxr-xr-x 1 root root 266K Apr 5 20:54 'Plex Transcoder' -rwxr-xr-x 1 root root 4.5M Apr 5 20:54 'Plex Tuner Service' drwxr-xr-x 10 root root 4.0K Apr 7 01:07 Resources

I also have a question, why on the worker would I need the /config path (Plex Media Server data) too ?

pabloromeo commented 3 years ago

Hi! In the case of @bfvogel i believe there were environment variables missing, since it wasn't being run using these docker images.

In your case @shokohsc which images (and architecture) are you running?

shokohsc commented 3 years ago

I am running, on intel-nuc7-8 (64bits arch)

pabloromeo commented 3 years ago

And regarding your last question, you are correct, workers don't need access to /config nor use it. I'll double-check the readme documentation and remove that.

Do you have the logs of the Worker while it's starting up? It should output information regarding the identified architecture, the codecs it's downloading from Plex and things like that. It could give us a hint of what's happening.

shokohsc commented 3 years ago

Thanks for the answer on the plex media server data folder. Here are redacted for brevity logs on the worker targeted for a transcode request.

➜ charts git:(main) kube logs -n plex worker-6559f6b44f-t8vlm CLUSTERPLEX_PLEX_VERSION => '1.22.2.4282-a97b03fad' CLUSTERPLEX_PLEX_CODECS_VERSION => '73e06c8-3759' CLUSTERPLEX_PLEX_EAE_VERSION => 'eae-69c1de6-42' PLEX_ARCH => 'amd64' CLUSTERPLEX_PLEX_CODEC_ARCH => linux-x86_64-standard Codec location => /codecs/73e06c8-3759-linux-x86_64-standard Processing codec libaac_decoder... Codec libaac_decoder.so already exists. Skipping Processing codec libaac_encoder... Codec libaac_encoder.so already exists. Skipping Processing codec libaac_latm_decoder... ... Codec libzlib_decoder.so already exists. Skipping Processing codec libzmbv_decoder... Codec libzmbv_decoder.so already exists. Skipping ON_DEATH: debug mode enabled for pid [23] Computed CPU ops => 935907 Initializing Worker 1da5e43a-7a48-42f3-8aa6-6bcf695b23ab|worker-6559f6b44f-t8vlm Worker listening on port 3501 Worker connected on socket eQuPec2G4mR5KN_lAAAO Received task request Transcoding failed: { Error: spawn /usr/lib/plexmediaserver/Plex Transcoder ENOENT at Process.ChildProcess._handle.onexit (internal/child_process.js:240:19) at onErrorNT (internal/child_process.js:415:16) at process._tickCallback (internal/process/next_tick.js:63:19) errno: 'ENOENT', code: 'ENOENT', ...

pabloromeo commented 3 years ago

Mmmm I'm not seeing anything suspicious pop up there. Any chance you could enable DEBUG logging in the PMS General settings and attempt a transcode while monitoring the logs in the UI? That might provide additional information on what could be happening.

I take it you did have a working Worker service successfully transcoding before and something recently changed, or is this the first attempt at running it in k8s?

shokohsc commented 3 years ago

This is a first attempt at it on a kubernetes cluster, I discovered your project a few days ago. Sorry I must be that tired but I can't understand which logs you want me to forward you ? Ones I can find in plex media server UI (Console) or the ones the pms logs to stdout ?

pabloromeo commented 3 years ago

The one in the PMS UI, Console. I'm assuming there may be additional warnings or error messages to try to determine why it's failing.

I run the ARM version in a swarm cluster and it's transcoding remotely without issues.

Could it be a permissions issue related to the PUID and PGID?

shokohsc commented 3 years ago

Hello, I have removed PUIG & PGID variables without success, here are pms UI logs while still having the same ENOENT error on a worker transcoder: https://pastebin.com/FKeeBVuB, I also tried the -dev tag worker images yesterday without success.

pabloromeo commented 3 years ago

One thing I saw in the logs was:

Apr 26, 2021 10:44:51.504 [0x7f5ba9ffb700] Erreur — [Transcoder] [segment @ 0x1b6d840] Failed to open segment 'chunk-00027'
Apr 26, 2021 10:44:51.504 [0x7f5ba8ff9700] Erreur — [Transcoder] av_interleaved_write_frame(): No such file or directory
Apr 26, 2021 10:44:51.510 [0x7f5ba8ff9700] Erreur — [Transcoder] Error writing trailer of chunk-%05d: Invalid argument

It looks like it can't write the transcoded output parts. It could be a configuration problem.

Review your volumes on Plex and the Workers. They must share the transcode location and have write access. In my case it's shared through GlusterFS but it would also work through NFS. The transcoder needs to be able to create directories and files that will later be picked up by PMS for streaming. - /path/to/transcodedata:/transcode

And the path on the docker container used for this transcoding must match your config in PMS: image

You configure this per your needs, but in my case the volume I mount is just /tmp and specify a "transcode" directory inside it. But both PMS and workers need to share that and represent the same local path within each of the containers for paths to match with PMS.

You also probably don't want to remove the PUID and PGID settings, but rather specify a proper user that will have access to these. You can find more info regarding this in the README and in the README for https://hub.docker.com/r/linuxserver/plex which is the base image for ClusterPex.

shokohsc commented 3 years ago

That was it ! I shared /transcode to all workers & main pms, and the error was gone.

Now onto 2021-04-26 21:15:08 | [eac3_eae @ 0x27efd40] EAE timeout! EAE not running, or wrong folder? Could not read '/tmp/pms-feb82f22-8ce8-485d-a60d-85505c6f8d33/EasyAudioEncoder/Convert to WAV (to 8ch or less)/dz6obh5xc0ec2f2dgiaowaog_106-0-9.wav'

even after sharing to all via nfs and chown 911:abc -R /tmp && chmod 777 -R /tmp, cannot start transcoding.

pabloromeo commented 3 years ago

So you are also mounting /tmp as an NFS volume now across both PMS and the workers? That's how i have it (except I use GlusterFS instead, which might make a difference if the Plex devs are using inotify or something like that). The entire /tmp is shared, and the transcode directory is a subdirectory within it.

I believe the EAE encoder is copied to that new /tmp directory by PMS before transcoding starts. You can monitor it and see if the contents get created at that location, and if it is, check within the Worker container if it is also visible and accesible with proper permissions. You might also want to ensure both the PMS and the Workers are running as the same PUID and GIDs.

To the Worker we are basically creating the illusion that it is a PMS so all these paths should match and be accesible.

shokohsc commented 3 years ago

I gave up !

I tried securityContext.fsGroup, securityContainer.runAsUser and securityContainer.runAsGroup with users I could find in pms & worker images without success (or luck) /tmp/pms-... is never writable after files in subfolders are being created and owned by the main user 'abc'.

Surprisingly, running as root did not help either. After reading and trying every solution I could find about this issue, tuning up fs.inotify.max_user_watches, deleting /tmp/pms-... folder, whatever (You can find some surprising things). I came back to using k8s at home base image defined in their chart, will see how that goes next time I'll find time to watch something.

Maybe a new version of plex will write that or those encoders someplace else where it won't cause any issue and I'll circle back to this project.

Thanks for the assist.

Just out of curiosity: In theory, you could spawn the container/pod on demand but you went with the hot container/pod instead, why ? Wouldn't it be less wasteful in terms of resources to spawn the worker instead of having only a defined set of some containers/pods, even when you're sleeping ?

pabloromeo commented 3 years ago

Ah, that's a shame. Yeah, I haven't tried running it in k8s yet. Some other user was seeing similar errors regarding the EAE encoder using that /tmp folder, which meant that some media (like trueHD, or DD I believe) wouldn't work, but other audio tracks were working fine. I was also researching a bit around this EAE timeout issue, and saw several posts about it being related to using inotify to monitor the FS, which probably would cause problems in distributed or shared filesystems. Plex might take a different route to that at some point, and avoid the issue. I mean, they seem to handle it just fine for transcoded video in detecting segments, so I'm not sure why with this encoder it's a different approach.

Regarding the spawning of containers/pods, yeah, totally doable. In fact I believe at some point I saw someone running plex on k8s with an approach similar to that for transcoders. Don't know if the project is still active though. I just went with this approach because it was the simplest for my use-case, in which it's my own hardware, while being idle it consumes nearly nothing, and transcodes would start instantly without the startup time of the workers. It also meant that the actual implementation didn't have to have knowledge of the cluster orchestrator behind it, in fact you could run it without one. I mean I run it in Swarm, but nothing in the code forces you to use any specific one. Also one could have autoscaling in place, or bring down replica counts to zero at certain times or loads.

Idle usage seemed pretty insignificant to be worth the effort of doing on-demand: image

But yeah, there are certainly different approaches to it, each with their pros and cons. In fact, UnicornTranscoder might do the on-demand pod creation you're mentioning and I believe they have Helm Charts for k8s. You can give it a try before giving up on remote transcoding :) I initially created this for personal use, before hearing about Unicorn.

shokohsc commented 3 years ago

@pabloromeo thanks for the clarification.

I have an update on the EAE timeout issue, indeed clusterplex isn't to blame: https://forums.plex.tv/t/error-transcoder-ac3_eae-0x2fd4f40-eae-timeout-eae-not-running-or-wrong-folder/189917/74

I started pms & worker pods on the same node and it did work., Worker started transcoding without issue.

It doesn't help me because I need to push transcoding process to another node to free resources for pms and its god-forsaken sqlite3 database, but the error went away because it picked the host fsevent about the /tmp directory shared and used by EasyAudioEncoder.

I still do not know how you made it work on your side. Do GlusterFS implement fsevent propagation maybe ? Were worker and pms on the same node on your side ?

I'm still using nfs-server-provisioner backed by rook (cephcluster) for RWX volumes and I may need to change to a ceph solution before trying to propagate by myself fsevents from one node to the other (I'd would really like not to have to do that).

urwrstkn8mare commented 2 years ago

Hi I'm having the same problem, I've read the thread but I'm not sure if I understand what you did to fix the problem.

pabloromeo commented 2 years ago

Have you tried with content that does't require using EAE (Plex's EasyAudioEncoder)? There's a clusterplex config setting that forces EAE transcodes to run locally instead of distributed on the workers. So if most of your content does not use EAE you can turn that on. I'm looking into getting EAE working on the Workers, it's just a bit cumbersome getting it to work both on Amd64 as well as ARM, due to how Plex versions EasyAudioEncoder and their urls for downloading versions during upgrades.

urwrstkn8mare commented 2 years ago

oh i didn't think it need EAE but i'll try it, thanks