mediacms-io / mediacms

MediaCMS is a modern, fully featured open source video and media CMS, written in Python/Django and React, featuring a REST API.
https://mediacms.io
GNU Affero General Public License v3.0
2.52k stars 459 forks source link

Only finalize an encoding when all chunks have finished #938

Open KyleMaas opened 6 months ago

KyleMaas commented 6 months ago

Description

Make sure that all chunks are done before kicking off HLS creation. Fixes #929

Steps

Pre-deploy

Post-deploy

KyleMaas commented 6 months ago

Hmm...actually, I've only tested this on successes. It probably doesn't handle cleanup correctly for failures. Converting this to a draft until I can get that figured out.

KyleMaas commented 6 months ago

Okay, I think this should be good now.

KyleMaas commented 5 months ago

So, after running this in production for a while, this is still a problem. This does seem to fix the case where failures happen, but it only runs after the chunk has already been marked as a success, which means if the concatenate and post encode processes take a long time, other chunks may also go into "success" state, see that this one is in "success", and proceed to also concatenate and post encode. So this still needs some work. This check really should probably happen on the transition from "running" to "success", not after it has already been marked "success" on the chunk.

mgogoulos commented 5 months ago

@KyleMaas thanks for submiting this PR along with the other, I'm struggling to find some time to handle all of them...

as far as this one is related, I'm wondering what are the cases you encounter the problem. Like what types of volumes of video are necessary for it to happen. I haven't seen it myself, but it could be the case that I've missed it, specially if there's not much logging. Perhaps add more log messages to help debug the case?

KyleMaas commented 5 months ago

@mgogoulos The main issues come in with long, very large videos. For example, I've got a server processing one video right now that's about 3 hours long. Fairly high resolution. Don't remember the file size, but it's in the several gigabytes range. That server's been processing that one video for about 22 hours. And currently it has one Bento operation running and another 30 concurrent instances of cp -rT copying to the same output directory. Usually there are more Bento instances running concurrently as well, but I think this one's just about done processing so now it's mostly just a bunch of copy operations stepping on each other.

KyleMaas commented 5 months ago

@mgogoulos Had some more time to track this down further, and I think this does actually fix part of the problem. But there is also a separate problem I just filed as #962 which is at least as bad if not worse. So I think this one's still valid and is actually probably correct.