mediacms-io / mediacms

MediaCMS is a modern, fully featured open source video and media CMS, written in Python/Django and React, featuring a REST API.
https://mediacms.io
GNU Affero General Public License v3.0
2.74k stars 507 forks source link

Chunks show encoding a Fail status, but not any apparent problems #929

Open KyleMaas opened 10 months ago

KyleMaas commented 10 months ago

Describe the issue Often times when transcoding to an MP4 profile one of the chunks of an encoding will come back as an encoding fail but the video seems to play fine if I pull the actual encoded file. The logs don't show any errors and look the same as successful chunks, Retries is 0, Status is Fail, and Total run time is also 0. But then if I click on Encoding file it plays just fine and everything looks okay. I also never see this with VP9 encoding.

To Reproduce I'm not sure how to reproduce this. However, I did not see this problem until I moved to a server setup with much higher capacity but also higher disk latency. I've raised the timeouts for Celery to allow each task to run for considerably longer, far longer than successful chunks take to encode, but the issue persists.

Expected behavior I would expect that the chunk, which from everything I'm seeing did in fact succeed in encoding, would be marked as Success instead. What I find most odd is that Total run time is 0 - seems like even for a failure, that should be filled in with some kind of a value.

Screenshots Can't share screenshots of this setup unfortunately.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

KyleMaas commented 10 months ago

Aha, caught one in the act. From the Celery log:

[2023-12-02 15:06:34,881: ERROR/ForkPoolWorker-4] Task encode_media[91a5fb7e-4692-4c6f-b872-cfbd757d8985] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
    return self.run(*args, **kwargs)
  File "(path redacted)/files/tasks.py", line 365, in encode_media
    encoding.media_file.save(content=myfile, name=output_name)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/fields/files.py", line 99, in save
    self.instance.save()
  File "(path redacted)/files/models.py", line 1171, in save
    super(Encoding, self).save(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
    self.save_base(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
    post_save.send(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
    return [
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "(path redacted)/files/models.py", line 1569, in encoding_file_save
    encoding.save()
  File "(path redacted)/files/models.py", line 1171, in save
    super(Encoding, self).save(*args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
    self.save_base(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
    post_save.send(
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
    return [
  File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
    (receiver, receiver(signal=self, sender=sender, **named))
  File "(path redacted)/files/models.py", line 1629, in encoding_file_save
    instance.media.post_encode_actions(encoding=instance, action="add")
  File "(path redacted)/files/models.py", line 643, in post_encode_actions
    tasks.create_hls(self.friendly_token)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/local.py", line 182, in __call__
    return self._get_current_object()(*a, **kw)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 761, in __protected_call__
    return orig(self, *args, **kwargs)
  File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/task.py", line 411, in __call__
    return self.run(*args, **kwargs)
  File "(path redacted)/files/tasks.py", line 445, in create_hls
    shutil.rmtree(output_dir)
  File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '(path redacted)/media_files/hls/5033c4cda075419cb11bd712b21a88c66Di5uhLjQ'
KyleMaas commented 10 months ago

I'm wondering at this point if maybe there are somehow two concurrent create_hls() calls going at the same time and one is erasing the directory while the other one's copying it.

KyleMaas commented 10 months ago

It is most definitely running multiple instances of create_hls() concurrently. I added logging to the beginning and end of it, as well as catching and ignoring the FileNotFoundError when rmtree() runs and logging when it happens. Here's the result:

[2023-12-03 12:04:42,607: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,702: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,760: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,277: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,792: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,253: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,314: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,348: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,379: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,513: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,411: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,597: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,628: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:48,376: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:49,661: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:07:50,874: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,191: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,432: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:01,998: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,885: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,974: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,536: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,520: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,076: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,947: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,011: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,980: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,138: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,490: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,638: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,336: INFO/ForkPoolWorker-13] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,929: INFO/ForkPoolWorker-14] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,995: INFO/ForkPoolWorker-16] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,944: INFO/ForkPoolWorker-21] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,599: INFO/ForkPoolWorker-22] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,920: INFO/ForkPoolWorker-13] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,015: INFO/ForkPoolWorker-13] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,362: INFO/ForkPoolWorker-16] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,424: INFO/ForkPoolWorker-16] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,393: INFO/ForkPoolWorker-14] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,515: INFO/ForkPoolWorker-14] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,611: INFO/ForkPoolWorker-21] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,697: INFO/ForkPoolWorker-21] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:09,680: INFO/ForkPoolWorker-22] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:10,411: INFO/ForkPoolWorker-22] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
KyleMaas commented 10 months ago

Need to do some more debugging, but from reading through the code I think this assumption is not true:

https://github.com/mediacms-io/mediacms/blob/c5047d8df8686d75100e5099489be4fd1bf5f733/files/models.py#L1457

KyleMaas commented 9 months ago

So, I think what's happening is that a chunk gets to this point:

https://github.com/mediacms-io/mediacms/blob/c5047d8df8686d75100e5099489be4fd1bf5f733/files/models.py#L1493

If the disk throughput is low or the video is very long, there is nothing preventing another encoding save() action from coming along and triggering another copy of this. Basically any time an encoding is saved, it rechecks if the encoding has all the chunks complete, and if so, it triggers another recombine of all of the chunks. This process isn't triggered when a chunk goes from in-progress to successful, but any time a chunk is saved during or after being marked as successful, so any updates can trigger this. After the chunks are combined, it goes through and deletes the chunks, but if a bunch of them are saved while it's combining them it can cause multiple concurrent combine operations, resulting in multiple concurrent HLS creations when the combines are finished.

KyleMaas commented 9 months ago

It's possible that by changing these check loops:

https://github.com/mediacms-io/mediacms/blob/c5047d8df8686d75100e5099489be4fd1bf5f733/files/models.py#L1475

https://github.com/mediacms-io/mediacms/blob/c5047d8df8686d75100e5099489be4fd1bf5f733/files/models.py#L1480

To check to make sure that each chunk is in a "success" state instead of just making sure that it has media, it should make it so it's not able to do concurrent combines.