Closed christensenjairus closed 6 months ago
From the logs it appears that what is failing is plex's EasyAudioEncoder (eae). Could you please restart one of the workers and take note of the logs while it starts? It will output information about the eae paths and if it was able to download that encoder from Plex. Also give that you are using longhorn you might want to try using a longhorn RWX volume. That's how I run it and haven't had issues yet. (it's nfs behind the scenes, but I didn't have to deal with any mount options).
Okay, I'll do that now. I forgot to mention that I switched over to NFS after seeing this issue - and then the issue came back. At the time of this log, I was using a longhorn RWX volume (1 replica) attached to 7 workers. I had thought that the overhead of longhorn volumes was causing it, but I guess not. I'll go back to longhorn for testing if it'll help narrow down the issue.
I'll edit my original post to just be longhorn - that's really how I'd like to run it anyway Edit: Done
This is the startup logs from the specific worker that had the failure RawEventsFromKube5WorkerThatHadError.txt
And here are some more surrounding logs of the failure Kube5ErrorLogs.txt
EDIT: I realized that my logs from the original issue message were from all the workers together, so I updated it to be just the isolated logs from the one worker.
On Longhorn as well, this is happening pretty frequently. When it happened last, I deleted the worker's config PVs and restarted them (to download EAC again). I see these error logs resurface after about 8 hours.
So, the errors go away when the PV is recreated, and then appear again after about 8 hours? Quite odd. I'm not currently able to reproduce it at the moment. How about when reducing the number of replicas, does it prevent the issue?
I ran this a second time with 3 replicas and the issue resurfaced. If it helps, this occurred (mostly) on a Windows PC watching an .mkv file.
This issue is stale because it has been open for 30 days with no activity.
I seem to be seeing the same issue on my end as well. I am running near the exact same setup with 7 replicas. Users start to complain of audio cutting in and out after about 2 days of uptime. The only resolution I have found is to restart the clusterplex container.
Please try the latest release, given that it's a fix specifically targetting EasyAudioEncoder that was failing after recent changes by Plex.
https://github.com/pabloromeo/clusterplex/releases/tag/v1.4.6
Updated Helm Chart version that installs v1.4.6 is v1.1.1
Did a fresh install with Helm Chart v1.1.1 and I am receiving this error.
So the common situation for everyone is using Longhorn with an RWX volume for transcoding and EasyAudioEncoder showing errors in the logs and adding segments of silence into the audio?
Has anyone attempted changing the Longhorn Data-Locality setting to strict-local
to see if this changes when you have a replica physically collocated with the Worker?
@pabloromeo I just start seeing this "adding samples of silence" myself this morning. Same Longhorn RWX. 5 nodes 3 workers.
I have 5 replicas across 5 nodes, so each node has its own copy. However, Longhorn shows the Attached Node & Endpoint of only node1, so all 3 workers are using that single Endpoint on node1 (as far as Im aware). set to best-effort btw.
Incidentally, the worker0 on node1 with the Endpoint has no EasyAudioEncode processes running, while worker1 on another node has 10 processes in top, and worker2 on another node has 30+ EasyAudioEncode processes in top (talkin about 63min time+).
Can you provide the logs for when one of the worker containers is initializing and also right when it triggers the transcode?
Also, if you delete the entire /codecs contents, do the errors still occur?
Its currently detecting intros and credits so its going to be a bit before i can get you a clean log of just what you want to see, but i will
But i can update on the data-locality thing: I scaled back the workers to just the worker0 that is on the same node as the /transcode and /config Entrypoint mounts and I still got the adding silence logs (pms, worker, /config, /transcode all on the same node)
Quit scanning and started over, got you logs. I would say reinstalling from scratch would count as delete the entire /codecs contents. These logs overlap at 'Worker connected on socket'
Sorry for poor formatting
EasyAudioEncoder uses the hardcoded path of /tmp
as a location for generated segments, which are later picked up by the actual transcoder.
You haven't by any chance mapped /tmp to a shared location or anything like that, correct? You can also take a look in /tmp and see if while transcoding content that requires EAE, to see if it's created files in there.
For example:
'./pms-a8728a59-283f-48aa-ae99-20766d56b3d6/EasyAudioEncoder/Convert to WAV (to 8ch or less)':
total 784K
drwxr-xr-x 2 abc abc 4.0K Sep 18 13:05 .
drwxr-xr-x 8 abc abc 4.0K Sep 18 12:53 ..
-rw-r--r-- 1 abc abc 50K Sep 18 13:05 k8m7zaotz4hv2txj7ekoq929_1137-0-611.ec3
-rw-r--r-- 1 abc abc 721K Sep 18 13:05 k8m7zaotz4hv2txj7ekoq929_1137-0-611.wav
No, I have not mapped /tmp to anything, its au natural.
I added 10 episodes yesterday, and 12 today. = 944K files and counting inside that same 'Convert to WAV (to 8ch or less)' from today and yesterday. (it did not remove the 600K wav files after it was finished Detecting Intros yesterday it seems)
there is a .ec3 and a .tmp file in there at times too.
I haven't been able to reproduce the errors around inserting silence, neither locally nor on k3s with longhorn. Have you found a public domain video that causes it, so that I can try that and see if it's maybe something with a specific codec setting?
Just in case it might be related, I've released a newer version which upgrades EAE to a more recent version. Unfortunately I still haven't found a way to identify which one is the latest version available from Plex, so we basically gotta try different incremental version numbers against this URL until we have a hit: https://plex.tv/api/codecs/easyaudioencoder?build=linux-x86_64-standard&deviceId=asdf9396-e49c-45cb-bc03-00b46ff03fd9&oldestPreviousVersion=1.32.5.7349-8f4248874&version={version_number}
The latest one I found is version 1983, which is the one I referenced in yesterday's release.
It might be a long shot, but if these are large libraries, could it be related to the inotify max watches issue that's has usually popped up through the years related to EAE related to plex?
https://reddit.com/r/PleX/s/IStFVBkdbc
My test environment has a small library so maybe that's why I can't reproduce it.
It might be a long shot, but if these are large libraries,
Sorry mate, Im doing a single episode with EAC3 audio here: Forgive me, Im not sure how to see what exact dockermod is running inside the container.... I did delete all pods and reinstall with the ImagePull: Always, so Im hoping I got your newest release. If so, no joy.
HOWEVER: just for funzies, I scaled back the workers and delete the pms container to get a fresh setup and let the pms container do the work of detect intro (repeated x2 for verification):
No Workers - New PMS
delete library, delete pms container, scale the worker back up to 1 add library to scan:
PASS No.1: New Worker - New PMS
PASS No.2: Same Worker - Same PMS
PASS No.3 Same Worker - New PMS
I really hope this points you in the right direction so im not just wasting cycles here lol
@audiophonicz Yes, that info is very useful. In fact it led me to implement a different approach to launching EAE. Initially we were launching one EAE per transcode request, however, your description led to exploring if Plex actually only needs one single process in the replica which handles all transcodes, which appears to be the case.
I've also recently been testing the new approach to child process monitoring, and it seems to be working better now, regarding dangling transcoders (except for the case were it's Plex's bug and we must wait 3min for it to get killed). Ultimately, after some time all child processes will be killed.
There is one exception though, and that is the EAE process. That one gets started on the first transcode and will remain alive until the worker is restarted. That is intentional, now you will see a single EAE child of the worker.js process, remain and be shared among N transcodes and shouldn't die.
If you find the time and would like to test out these changes, you can switch to the experimental
version.
If you're using the Helm chart that would be specifying the clusterplexVersion: experimental
setting in values.yaml
@pabloromeo no, I'm not using the Helm version, I'm using Manifest version. which brings up 2 things:
are you keeping the Manifest/Kubernetes version link up to date with your changes, if need be? Am I missing any changes by not using the Helm chart? And is it as simple as using clusterplex_dockermod:experimental
?
I am not using the Helm chart because it doesnt look like its using the recent Common templates and therefore does not accept the extraVolumes and extraVolumeMounts Helm values, which means I cannot specify my NFS mount because your media section looks hardcoded to PVs. using Manifest is easier to customize at the moment.
I'll give the experimental version a try when I get a chance, month end has been beating me lately. I have a few ideas of my own to test out as well, after looking more into longhorn and the underperformance of RWX vols.
Yup, just changing the dockermod tag to experimental
is enough in this case :).
@pabloromeo ok trying again with experimental, omitting the tests that didnt generate silence logs initially.
Pass1 - No Workers - New PMS
delete library, re-add library
Pass2 - No Workers - Same PMS
delete library, spun up worker, re-add library
Pass3 - New Worker - Same PMS
Oops, my worker wasnt fully started on Pass3 - retried again
Pass3v2 - "New" Worker - Same PMS
delete library, re-add library
Pass4 - Same Worker - Same PMS
Ok so none of these tests resulted in stale wav files left over in /tmp, nor any adding samples of silence logs in PMS, nor any additional EasyAudioEncoder processes in any of the containers.
So if I'm reading that correctly all those scenarios worked as expected. No dangling processes or audio cutting out.
FYI the EAE Service on the main PMS is expected to be running and idle during remote transcodes. It is started by Plex itself, but shouldn't cause any harm or consume resources.
So if I'm reading that correctly all those scenarios worked as expected. No dangling processes or audio cutting out.
FYI the EAE Service on the main PMS is expected to be running and idle during remote transcodes. It is started by Plex itself, but shouldn't cause any harm or consume resources.
I do believe so, yes. It seems to be working now on my installation.
Excellent. So as soon as I can I'll merge that to main and do a new versioned release with this implementation and newer plex EAE distributable, so you can move off of experimental
. Since it's where we try weird things out and might break your environment in some cases.
@pabloromeo I am testing with experimental today. I am running PMS on one node and 2 workers on other nodes. I haven't been able to get it working without experimental, so I am not sure if these issues are issues without it or not. I started this server from a fresh config setup using a slightly modified helm template (I added env support that was not working for FFMPEG_HWACCEL: vaapi, limits set with amd.com/gpu: 1, and currently set it to mount plex config to workers due to issue #223).
This issue is stale because it has been open for 30 days with no activity.
Hi! I was wandering when this will be merged with main? I have the same problem, strangely not all the time. I run the cluster on glusterFS.
This issue is stale because it has been open for 30 days with no activity.
Hi! I was wandering when this will be merged with main? I have the same problem, strangely not all the time. I run the cluster on glusterFS.
Finally got around to merging it :) Sorry for the delay
Describe the bug
My worker nodes are having issues decoding and the logs are pretty vague. Because of the issue, they're giving up and inserting silence into the stream, making the audio cut in and out.
My transcode folder is a RWX volume in longhorn.
To Reproduce
Use ArgoCD to deploy clusterplex (master branch) with the
values.yml
included in the additional context section.Expected behavior
No errors while decoding, a clear audio stream without cutting out.
Additional context
Log file:
Values.yml