ressu / kube-plex

Scalable Plex Media Server on Kubernetes -- dispatch transcode jobs as pods on your cluster!
Apache License 2.0
102 stars 24 forks source link

Delay between stream requests and streaming begins #47

Open rkbennett opened 1 year ago

rkbennett commented 1 year ago

Kubernetes: 1.27.2 Plex: 1.32.4.7195

When starting a stream, the elastic transcode pod gets spawned almost immediately, however it takes more than a minute for the stream to actually begin.

Any thoughts on what could be causing this behavior? I don't see any errors in the logs.

rkbennett commented 1 year ago

This is partially related, but do you think it'd be feasible to disable the codec collection logic and just do a shared volume between the plex server and the elastic pods? Based on what I've been reading plex will pull the codecs on start-up if they aren't in the codecs folder. If the logic for pulling the codecs off of the plex server was removed, we could potentially just leverage the normal behavior of plex and let it manage the codec pulls. Would also have an added benefit of not having to re-pull the codecs with every spawned transcode pod.

ressu commented 1 year ago

Honestly, I don't know what could be causing the delay. The codec binary blob download should be near instant as we are talking about a handful of megabytes of data.

The reason why I don't want the native mechanism to do the work of downloading the codecs is twofold.

  1. The download would happen from outside the network. So inherently the time would be longer than downloading from the other pod.
  2. The version of the codecs might be different and makes troubleshooting very hard. At least this way the codecs are always the same version on all transcoders

Another issue would be the number of downloads being triggered. Since transcode could be triggered tens of times during a regular watch session (due to change of framerate for example), this would also cause a lot of noise on Plex download servers, which is something I don't want to do. I wan't to be a good neighbor here.

Another option, and something that kube-plex used to do, is to share the config folder. The problem here is that shared write many block storage is significantly more complex to set up and maintain than a single writer block storage. Something like Ceph could definitely handle this, but it's a lot of overhead to just run kube-plex. There is also the problem of file locks. Those are often a lot slower then with plain old block storage and since Plex uses sqlite internally, file locking is a bottleneck.

I guess a cached storage volume could be an option, maybe having an ephemeral local storage for that. It's not something that I've spent too much time thinking about, but in theory should be possible.

rkbennett commented 1 year ago

It shouldn't be an issue with SQLite, I just switched to mayastor for block storage and when I run regular Plex the streams start almost immediately. So I'm not sure where else the issue could be. The pods also spawn within a second, so it definitely has to be either some communication between Plex and transcode or something on the transcode itself.

ressu commented 1 year ago

One thing we could try is to add telemetry to the connection bits to make sure that the connections are happening in a timely fashion. This would also allow confirming the assumption that the codec download doesn't take too much time.

Considering that logging seems to work reliably, a quick way to do this would be to just log the time to connect to log. That way we get telemetry directly in Plex internal logging.

rkbennett commented 1 year ago

Okay, so I did a bit more testing and switched the transode pvc to local and now the tv shows are streamed within 2 seconds. The movies, however still don't stream at all. In fact, when I try to stream them, no files are ever generated in the transcode session directory. I also don't see anything that stands out in the logs. I can see all the codecs being loaded as well, just no files are ever written. Only thing I can see is a log that repeats that says Completed: [10.244.2.1:43736] 404 GET /video/:/transcode/universal/session/zuht8tkcbnm60mc7wkc3ahs9/0/header (16 live) #cd3 GZIP 22326ms 379 bytes (pipelined: 1)

ressu commented 1 year ago

Which codec are the movies using? I know that there are some codecs that rely on direct access to temporary files to actually work. That's the reason for the whole bypass list solution. This sounds a lot like the transcoder is waiting for some signal from Plex but just hanging because it's missing.

rkbennett commented 1 year ago

I know one of them was vc1/ac3, I did some more testing and it also seems to only be working on my smaller file size tv shows. I have some that are 14GB and it won't work on those either

ressu commented 1 year ago

I don't think I have files that big, but seems odd that the size would matter in this case as the same operation would need to be done without kube-plex too. I'm very much running out of ideas here..

rkbennett commented 1 year ago

Mhm, definitely a head-scratcher. Haven't had a chance to do any more testing this weekend, but I'll hopefully have some time to try a couple tests tinkering with the pod command here tomorrow.

rkbennett commented 3 months ago

Okay, so I know it's been a while but I finally got around to upgrading my backplane network to 10Gb/s. So after a little more testing with a faster connection to the network share that is hosting the media files what I'm seeing is: 25 min episodes ~> 9 second delay 45 min episodes ~> 15 second delay movies ~> 1 minute delay There's definitely something with the file size that is coming into play here. Not sure how the processing is done for the transcode but definitely something to this.

ressu commented 3 months ago

This sounds a lot like the transcoding process itself is taking time to scan the file before the actual transcoding starts. This means there is not much we can do about it in Kube-Plex.

What is slightly bugging me is why doesn't this happen with the native transcoding. The transcoding process needs to scan the same portions regardless. The first thing that comes to mind would be that it's due to filesystem cache, but that's a whole lot of data to cache and where is the caching hidden in the native solution.

An interesting thing to see would be to check the network traffic from the node to see how much data is transferred before the playback starts and at what rate when compared to playback time data transfers.

rkbennett commented 3 months ago

I've kind of already done that, so if I try to play the same file again after it has started once, the transcode is instant. Which I'm assuming is because my NAS has the file in cache at that point. If I wait long enough and try again on the same file the delay comes back.