pabloromeo / clusterplex

ClusterPlex is an extended version of Plex, which supports distributed Workers across a cluster to handle transcoding requests.
MIT License
470 stars 36 forks source link

PMS originalTranscoder does not exit; cannot delete session directory #257

Closed audiophonicz closed 1 year ago

audiophonicz commented 1 year ago

Describe the bug When transcoding fails back to clusterplex-pms container (Distributed transcoder failed, calling local) the originalTranscoder is launched and does not exit with video stop, resulting in:

  1. Continuation of file generation in /transcoder/Transcoder/Sessions/plex-transcode-xxxxxxx/yyyy.ts files which
  2. Cause the directory cleanup to fail because the directory is not empty and
  3. Subsequent plays of the same video (e.g. Resume) start another instance of originalTranscoder that does not quit
  4. etc etc etc
  5. resulting in multiple originalTranscoder processes using 50+% CPU and filling the /transcode PVC.

This has been observed with both software and hardware transcode.

To Reproduce

  1. Scale down the worker pods to 0 to force clusterplex-pms container to do transcode k scale -n namespace statefulset clusterplex-worker --replicas=0

  2. Start playing video that requires transcode (in my case HEVC to a Galaxy Tab in Plex app)

    
    JobPoster connected, announcing
    Orchestrator requesting pending work
    Sending request to orchestrator on: http://clusterplex-orchestrator:3500
    Distributed transcoder failed, calling local

3. Exec into clusterplex-pms container and run TOP command
`k exec -n namespace -it clusterplex-pms-5b4d95555b-6gkdb -- /bin/bash`

4. run top cmd inside container, observe originalTranscoder process
5. stop playing transcoded video (see play dissappear from Plex Dashboard)
6. observe originalTranscoder continue to run
7. watch Sessions dir continue to write .ts files to session dir
`watch ls -al /transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-xxxxx/`

8+. Subsequent Resume attempts will start a new originalTranscoder process and create a new session dir

**Expected behavior**
when transcoding fails back to clusterplex-pms container, PlexTranscoder is launched instead of originalTranscoder
and/or
regardless of which transcoder process starts, it stops when the transcoded video stops playing.

**Screenshots**
Debug Log Snippet from PMS Server

Sep 15, 2023 17:49:47.710 [140006365915960] DEBUG - Shutting down idle session 9d599c3d6e4fb7eb-com-plexapp-android (idle time is 180 seconds) Sep 15, 2023 17:49:47.710 [140006365915960] DEBUG - Killing job. Sep 15, 2023 17:49:47.710 [140006365915960] DEBUG - Signalling job ID 1579 with 9 Sep 15, 2023 17:49:47.710 [140006365915960] DEBUG - Job was already killed, not killing again. Sep 15, 2023 17:49:47.710 [140006365915960] DEBUG - Stopping transcode session 9d599c3d6e4fb7eb-com-plexapp-android Sep 15, 2023 17:49:47.711 [140006330055480] DEBUG - Cleaning directory for session 9d599c3d6e4fb7eb-com-plexapp-android (/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1) Sep 15, 2023 17:49:47.727 [140006402243384] DEBUG - Request came in with unrecognized domain / IP 'clusterplex.example.lab' in header Referer; treating as non-local Sep 15, 2023 17:49:47.727 [140006255430456] DEBUG - Request: [10.42.2.4:58426 (Allowed Network (WAN))] GET /status/sessions (10 live) #112f TLS GZIP Signed-in Token (PlexUser) (Chrome) Sep 15, 2023 17:49:47.727 [140006255430456] DEBUG - [Req#112f] [Now] Adding 0 sessions. Sep 15, 2023 17:49:47.729 [140006404352824] DEBUG - Completed: [10.42.2.4:58426] 200 GET /status/sessions (10 live) #112f TLS GZIP 2ms 486 bytes (pipelined: 22) Sep 15, 2023 17:49:47.841 [140006365915960] DEBUG - Transcoder: Cleaning old transcode directories. Sep 15, 2023 17:49:47.852 [140006406945592] DEBUG - Jobs: '/usr/lib/plexmediaserver/Plex Transcoder' exit code for process 1579 is -9 (signal: Killed) Sep 15, 2023 17:49:47.935 [140006365915960] DEBUG - Transcoder: Cleaning old transcode directory: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-1f23547b-9471-4c9d-8a7c-dd723dc556a3" Sep 15, 2023 17:49:48.002 [140006363020088] DEBUG - [Req#113b/Transcode/9d599c3d6e4fb7eb-com-plexapp-android/d9c1fab6-ba81-4ff4-8fcb-47de8d210d41] Throttle - Going into sloth mode. Sep 15, 2023 17:49:48.011 [140006263868216] DEBUG - [Req#113c/Transcode/9d599c3d6e4fb7eb-com-plexapp-android/7d8629d3-f086-48e6-b5ec-ad451c9edaa1] Throttle - Going into sloth mode. Sep 15, 2023 17:49:48.247 [140006255430456] DEBUG - [Req#1141/Transcode/9d599c3d6e4fb7eb-com-plexapp-android/1f23547b-9471-4c9d-8a7c-dd723dc556a3] Throttle - Going into sloth mode. Sep 15, 2023 17:49:48.338 [140006330055480] ERROR - Failed to delete session directory (boost::filesystem::remove: Directory not empty [system:39]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1") Sep 15, 2023 17:49:48.404 [140006261758776] DEBUG - [Req#1145/Transcode/9d599c3d6e4fb7eb-com-plexapp-android/6d77c795-0b97-469a-808f-4299b02e154a] Throttle - Going into sloth mode. Sep 15, 2023 17:49:48.408 [140006402243384] DEBUG - Request came in with unrecognized domain / IP 'clusterplex.example.lab' in header Referer; treating as non-local Sep 15, 2023 17:49:48.408 [140006363020088] DEBUG - Request: [10.42.2.4:58426 (Allowed Network (WAN))] GET /statistics/bandwidth?timespan=6 (10 live) #1139 TLS GZIP Signed-in Token (PlexUser) (Chrome) Sep 15, 2023 17:49:48.412 [140006404352824] DEBUG - Completed: [10.42.2.4:58426] 200 GET /statistics/bandwidth?timespan=6 (10 live) #1139 TLS GZIP 4ms 2058 bytes (pipelined: 23) Sep 15, 2023 17:49:48.481 [140006365915960] ERROR - Transcoder: Failed to delete session directory (boost::filesystem::remove: Directory not empty [system:39]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-1f23547b-9471-4c9d-8a7c-dd723dc556a3") Sep 15, 2023 17:49:48.481 [140006365915960] DEBUG - Transcoder: Cleaning old transcode directory: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-6d77c795-0b97-469a-808f-4299b02e154a" Sep 15, 2023 17:49:48.823 [140006365915960] ERROR - Transcoder: Failed to delete session directory (boost::filesystem::remove: Directory not empty [system:39]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-6d77c795-0b97-469a-808f-4299b02e154a") Sep 15, 2023 17:49:48.823 [140006365915960] DEBUG - Transcoder: Cleaning old transcode directory: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1" Sep 15, 2023 17:49:48.935 [140006365915960] ERROR - Transcoder: Failed to delete session directory (boost::filesystem::remove: Resource busy [system:16]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1/.nfs00000000000200cf000008b1") Sep 15, 2023 17:49:48.936 [140006365915960] DEBUG - Transcoder: Cleaning old transcode directory: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-d9c1fab6-ba81-4ff4-8fcb-47de8d210d41" Sep 15, 2023 17:49:49.411 [140006404352824] DEBUG - Request came in with unrecognized domain / IP 'clusterplex.example.lab' in header Referer; treating as non-local Sep 15, 2023 17:49:49.411 [140006363020088] DEBUG - Request: [10.42.2.4:58426 (Allowed Network (WAN))] GET /statistics/bandwidth?timespan=6 (10 live) #1147 TLS GZIP Signed-in Token (PlexUser) (Chrome) Sep 15, 2023 17:49:49.412 [140006365915960] ERROR - Transcoder: Failed to delete session directory (boost::filesystem::remove: Directory not empty [system:39]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-d9c1fab6-ba81-4ff4-8fcb-47de8d210d41") Sep 15, 2023 17:49:49.412 [140006365915960] DEBUG - Whacked session 9d599c3d6e4fb7eb-com-plexapp-android, 0 remaining. Sep 15, 2023 17:49:49.412 [140006365915960] DEBUG - Killed 1 idle sessions out of a total of 1. Sep 15, 2023 17:49:49.412 [140006365915960] DEBUG - Cleaning directory for session 9d599c3d6e4fb7eb-com-plexapp-android (/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-1f23547b-9471-4c9d-8a7c-dd723dc556a3) Sep 15, 2023 17:49:49.423 [140006404352824] DEBUG - Completed: [10.42.2.4:58426] 200 GET /statistics/bandwidth?timespan=6 (10 live) #1147 TLS GZIP 12ms 2065 bytes (pipelined: 24) Sep 15, 2023 17:49:49.456 [140006365915960] ERROR - Failed to delete session directory (boost::filesystem::remove: Resource busy [system:16]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-1f23547b-9471-4c9d-8a7c-dd723dc556a3/.nfs00000000000200d1000008b2") Sep 15, 2023 17:49:49.456 [140006365915960] DEBUG - Cleaning directory for session 9d599c3d6e4fb7eb-com-plexapp-android (/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1) Sep 15, 2023 17:49:49.458 [140006365915960] ERROR - Failed to delete session directory (boost::filesystem::remove: Resource busy [system:16]: "/transcode/Transcode/Sessions/plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-7d8629d3-f086-48e6-b5ec-ad451c9edaa1/.nfs00000000000200cf000008b1") Sep 15, 2023 17:49:50.364 [140006263868216] DEBUG - Request: [10.42.6.1:48680 (Allowed Network (Subnet))] GET /identity (12 live) #1154 TLS Signed-in Sep 15, 2023 17:49:50.364 [140006402243384] DEBUG - Completed: [10.42.6.1:48680] 200 GET /identity (12 live) #1154 TLS 0ms 369 bytes


top cmd run inside clusterplex-pms container after 1x Play and 2x Resume of the same video file.

top - 18:28:43 up 4 days, 4:51, 0 users, load average: 5.33, 1.74, 1.87 Tasks: 29 total, 1 running, 28 sleeping, 0 stopped, 0 zombie %Cpu(s): 38.7 us, 6.3 sy, 16.7 ni, 36.2 id, 1.0 wa, 0.0 hi, 1.0 si, 0.0 st MiB Mem : 15883.0 total, 5198.3 free, 3307.1 used, 7377.6 buff/cache MiB Swap: 977.0 total, 977.0 free, 0.0 used. 12231.0 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

2425 abc 20 0 536372 396444 13156 S 90.8 2.4 0:09.01 originalTransco 2308 abc 20 0 547524 430448 13640 S 55.2 2.6 1:12.62 originalTransco 2373 abc 20 0 540404 398936 13024 S 48.0 2.5 0:26.28 originalTransco 1420 abc 20 0 185904 68240 39152 S 2.6 0.4 1:03.71 Plex Media Serv 1511 abc 20 0 41992 13164 11304 S 0.3 0.1 0:01.67 Plex Tuner Serv 1544 abc 20 0 65936 45668 11332 S 0.3 0.3 0:05.53 Plex Script Hos 1 root 20 0 208 60 44 S 0.0 0.0 0:00.03 s6-svscan 15 root 20 0 212 64 48 S 0.0 0.0 0:00.00 s6-supervise 16 root 20 0 200 4 0 S 0.0 0.0 0:00.00 s6-linux-init-s 68 root 20 0 212 64 48 S 0.0 0.0 0:00.00 s6-supervise 69 root 20 0 212 64 48 S 0.0 0.0 0:00.00 s6-supervise 70 root 20 0 212 64 48 S 0.0 0.0 0:00.00 s6-supervise 78 root 20 0 188 4 0 S 0.0 0.0 0:00.00 s6-ipcserverd 1441 abc 35 15 64460 43988 11164 S 0.0 0.3 0:06.97 Plex Script Hos 1462 root 20 0 212 64 48 S 0.0 0.0 0:00.00 s6-supervise 1464 root 20 0 7368 3740 3484 S 0.0 0.0 0:00.00 bash 1470 root 20 0 55216 12224 10576 S 0.0 0.1 0:00.01 nginx 1471 www-data 20 0 55556 3292 1336 S 0.0 0.0 0:01.23 nginx 1472 www-data 20 0 55556 3292 1336 S 0.0 0.0 0:00.00 nginx 1473 www-data 20 0 55556 3292 1336 S 0.0 0.0 0:00.00 nginx 1474 www-data 20 0 55556 3292 1336 S 0.0 0.0 0:00.00 nginx 1608 abc 20 0 57164 37036 11296 S 0.0 0.2 0:03.13 Plex Script Hos 1655 abc 20 0 55992 35828 11296 S 0.0 0.2 0:03.94 Plex Script Hos 1662 abc 20 0 55460 35396 11384 S 0.0 0.2 0:03.89 Plex Script Hos 1766 abc 20 0 55492 35356 11324 S 0.0 0.2 0:02.28 Plex Script Hos 1788 abc 20 0 63524 43488 11272 S 0.0 0.3 0:03.05 Plex Script Hos 2346 root 20 0 7632 4280 3664 S 0.0 0.0 0:00.01 bash 2358 root 20 0 10440 3932 3224 R 0.0 0.0 0:00.12 top 2414 abc 20 0 897552 56616 39108 S 0.0 0.3 0:00.39 node


Cannot delete session dir

root@clusterplex-pms-5b4d95555b-6gkdb:/# ls /transcode/Transcode/Sessions/ plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-5654bcfc-d16f-48ad-a64a-1f03afff217a plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-6e8aa105-f538-4ea5-bac4-99692b1ba309 plex-transcode-9d599c3d6e4fb7eb-com-plexapp-android-d36d3db8-80b2-4535-b388-ea15f61d8a08



**Desktop (please complete the following information):**
clusterplex:latest  (Manifest Style)
K3s v1.26.6+k3s1
NUC6i3SYK nodes using SkylakeHD520 iGPU with VAAPI

**Additional context**
The worker containers launch PlexTranscoder and correctly stop the process and clean the directory when video stops
(at least with software transcode. have not gotten remote hw transcode working correctly yet)
pabloromeo commented 1 year ago

Thanks!! Will check it out and try to reproduce

pabloromeo commented 1 year ago

Unfortunately I haven't been able to reproduce it yet, and get dangling transcoders to remain. They always shut down 3 or 4 seconds after playback is stopped. I've tested both on a distributed transcoding request that fails due to lack of workers as well as just having PMS's Operation Mode set to "local". Both are removing the child processes after a few seconds. This was running the services locally so I could debug them, but I'll soon try k3s to see if the results are any different and I can get transcoders to stick around.

audiophonicz commented 1 year ago

After a little further testing this morning, I started seeing it on the worker as well. Im seeing some vaapi errors in the logs too so regardless of whats going on there (well get to that) it takes a good 30+ seconds for the stream to actually start after play. If I stop the play before the stream on the device actually starts, the process continues. If I wait until the stream actually plays on the device, the transcoder exits normally. That might be the difference. Try starting and stopping it immediately before playback starts.

pabloromeo commented 1 year ago

That 30 sec delay sounds really odd. It should not take more than a second or two to have the worker load the codecs and begin outputting segments. Could it be related to the medium used for shared storage? Like very slow writes or something like that? I'm using longhorn as well using a rwx volume (1 replica), but it's on nvme drives.

What happens if you change the operation mode to be "local", can you also cause these zombie transcoders to continue processing?

pabloromeo commented 1 year ago

I am think you are right, when starting and stopping the playback quickly I've been able to create a dangling transcoder. It seems plex sends the unfriendly sigkill instead of sigterm signal, so it's difficult to handle being killed and do cleanup consistently. I'm gonna try a different approach to see if I can get it to be more consistent, using an additional monitoring script on the main and the child, to ensure children always get killed as well. I'll keep you posted.

audiophonicz commented 1 year ago

My 30sec delay is definitely HEVC10. My Skylake iGPUs can only HW decode HEVC 8-bit, not 10-bit, and plex doesnt do a very good job of differentiating between the two. I see it trying different workers over and over with some vaapi errors in the logs until either one finally catches with fast enough SW decode, or it gives up with a playback error. When playing HEVC 8bit or H264, plays are almost instant. also using Longhorn RWX (5 replica) but on m.2 SATAIII Ultimately thats a HW issue on my side, so you can ignore the 30sec delay.

Im glad you found the dangling transcoder processes, im not crazy :)

Thinking everything was good now, I added my full library to this instance and its still detecting intros/credits, so I'm going to assume its using the worker nodes to do that, but I noticed one of my workers has like 30+ EasyAudioEncode processes running, another worker has like 10, and the 3rd worker has none. I'll let you know if they clear up after the detections finish, or if that EasyAudioEncode is also dangling.

pabloromeo commented 1 year ago

After doing some testing, I've noticed that the dangling transcoder when you start and stop playback quickly, is actually not isolated to clusterplex, it happens with just a standard plex install. It appears that in that situation plex doesn't kill the transcoder, and in the logs you can see a message informing that it identifies the session as idle after 180 seconds, and then it kills it.

Aside from that, I am working on a different approach to how the child transcoder processes are monitored, which should improve the speed at which regular transcodes are removed once you stop playback or transcoding finishes. This won't fix plex's inherent wait of 3 minutes in the first case I mentioned but should improve other situations where it may have taken 5 or 10 sec to stop.

I'll let you know once I push that to the experimental branch in case you want to test it out.

Regarding credits detection, at the moment we cannot process that in the workers as it seems Plex has hard-coded certain paths to use /tmp, so in order to have it process in the worker you'd have to network share /tmp, which may cause other undesired consequences, so for the time being those are processed locally.

If plex gets its act together and starts using the /transcode path as they do for other features then yeah, we could handle those too, I would assume. I believe there's another issue discussing this.

audiophonicz commented 1 year ago

Ok, so, given the issues we've come across with EAE and /tmp junk, I can totally see this is a base Plex issue, but I'm not seeing any process termination.

I wiped clusterplex this morning, down to the namespace. Redeployed on a dedicated node with nothing on it but the 6 components for Longhorn, MetalLB, and IntelGPU. it was using 2% CPU at idle before, and 7% after. No workers, just PMS and Orch.

I imported a single file, no EAC3/EAE. Intro/Credits detection worked fine, no hanging processes.

I played 1 video, 1 time. HEVC and AAC direct play on this TV, but because roku sucks and I had subtitles enabled, it transcodes: It buffered, played, buffered after 10sec, played, and died with the "your server is not powerful enough" error.... to do subtitles lol (repeated a few times later and it always died within the 17sec timestamp like clockwork)

Anyways, the point is, I have 4x originalTranscoder processes running inside PMS, theyve been running for well over 3min (~20min), and even have over 3min of CPU time each (which you know is not realtime) and are actively creating files inside /transcode/Transcode/Sessions so those dir cannot be deleted. This also led to the node using 50% CPU leading to the "server not powerful enough" nonsense. (I have pms limited to 50% so it runs out of resources)

image

In fact, if I let them go long enough, Plex actually marks the 23min video that played for 17sec as Watched!

pabloromeo commented 1 year ago

Any chance you could try the experimental version that I mentioned in the other issue?

That version tracks processes differently and is able to shut down transcoders gracefully even if plex sends sigkill. Now I consistently see processes shut down either instantly when playback is stopped, or after 3min when plex cleans up the idle session it left behind.

With the experimental version if you do ps -auxwf you'll actually see an extra sentinel.sh which is monitoring both the Plex shim and the originalTranscoder, but all should be killed once transcoding stops.

audiophonicz commented 1 year ago

@pabloromeo I have not been able to reproduce this issue after switching to experimental. Closing this issue as corrected. Thank you for all your work on this project.