Open KyleMaas opened 10 months ago
Aha, caught one in the act. From the Celery log:
[2023-12-02 15:06:34,881: ERROR/ForkPoolWorker-4] Task encode_media[91a5fb7e-4692-4c6f-b872-cfbd757d8985] raised unexpected: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 477, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 760, in __protected_call__
return self.run(*args, **kwargs)
File "(path redacted)/files/tasks.py", line 365, in encode_media
encoding.media_file.save(content=myfile, name=output_name)
File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/fields/files.py", line 99, in save
self.instance.save()
File "(path redacted)/files/models.py", line 1171, in save
super(Encoding, self).save(*args, **kwargs)
File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
self.save_base(
File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
post_save.send(
File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
return [
File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
(receiver, receiver(signal=self, sender=sender, **named))
File "(path redacted)/files/models.py", line 1569, in encoding_file_save
encoding.save()
File "(path redacted)/files/models.py", line 1171, in save
super(Encoding, self).save(*args, **kwargs)
File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 814, in save
self.save_base(
File "/home/mediacms.io/lib/python3.10/site-packages/django/db/models/base.py", line 892, in save_base
post_save.send(
File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 176, in send
return [
File "/home/mediacms.io/lib/python3.10/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
(receiver, receiver(signal=self, sender=sender, **named))
File "(path redacted)/files/models.py", line 1629, in encoding_file_save
instance.media.post_encode_actions(encoding=instance, action="add")
File "(path redacted)/files/models.py", line 643, in post_encode_actions
tasks.create_hls(self.friendly_token)
File "/home/mediacms.io/lib/python3.10/site-packages/celery/local.py", line 182, in __call__
return self._get_current_object()(*a, **kw)
File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/trace.py", line 761, in __protected_call__
return orig(self, *args, **kwargs)
File "/home/mediacms.io/lib/python3.10/site-packages/celery/app/task.py", line 411, in __call__
return self.run(*args, **kwargs)
File "(path redacted)/files/tasks.py", line 445, in create_hls
shutil.rmtree(output_dir)
File "/usr/lib/python3.10/shutil.py", line 715, in rmtree
onerror(os.lstat, path, sys.exc_info())
File "/usr/lib/python3.10/shutil.py", line 713, in rmtree
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '(path redacted)/media_files/hls/5033c4cda075419cb11bd712b21a88c66Di5uhLjQ'
I'm wondering at this point if maybe there are somehow two concurrent create_hls()
calls going at the same time and one is erasing the directory while the other one's copying it.
It is most definitely running multiple instances of create_hls()
concurrently. I added logging to the beginning and end of it, as well as catching and ignoring the FileNotFoundError
when rmtree()
runs and logging when it happens. Here's the result:
[2023-12-03 12:04:42,607: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,702: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:42,760: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,277: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:43,792: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,253: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,314: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,348: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,379: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,513: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,411: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,597: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:44,628: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:48,376: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:04:49,661: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:07:50,874: INFO/ForkPoolWorker-2] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,191: INFO/ForkPoolWorker-2] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:00,432: INFO/ForkPoolWorker-9] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:01,998: INFO/ForkPoolWorker-2] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,885: INFO/ForkPoolWorker-9] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:05,974: INFO/ForkPoolWorker-9] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,536: INFO/ForkPoolWorker-7] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:06,520: INFO/ForkPoolWorker-6] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,076: INFO/ForkPoolWorker-8] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,947: INFO/ForkPoolWorker-6] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,011: INFO/ForkPoolWorker-6] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:09,980: INFO/ForkPoolWorker-7] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,138: INFO/ForkPoolWorker-7] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,490: INFO/ForkPoolWorker-8] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:08:10,638: INFO/ForkPoolWorker-8] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,336: INFO/ForkPoolWorker-13] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,929: INFO/ForkPoolWorker-14] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,995: INFO/ForkPoolWorker-16] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:04,944: INFO/ForkPoolWorker-21] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,599: INFO/ForkPoolWorker-22] create_hls[None]: Starting HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:05,920: INFO/ForkPoolWorker-13] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,015: INFO/ForkPoolWorker-13] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,362: INFO/ForkPoolWorker-16] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,424: INFO/ForkPoolWorker-16] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,393: INFO/ForkPoolWorker-14] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,515: INFO/ForkPoolWorker-14] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,611: INFO/ForkPoolWorker-21] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:06,697: INFO/ForkPoolWorker-21] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:09,680: INFO/ForkPoolWorker-22] create_hls[None]: rmtree failure because file not found for media UID 5033c4cda075419cb11bd712b21a88c6
[2023-12-03 12:13:10,411: INFO/ForkPoolWorker-22] create_hls[None]: Finished HLS creation for media UID 5033c4cda075419cb11bd712b21a88c6
Need to do some more debugging, but from reading through the code I think this assumption is not true:
So, I think what's happening is that a chunk gets to this point:
If the disk throughput is low or the video is very long, there is nothing preventing another encoding save()
action from coming along and triggering another copy of this. Basically any time an encoding is saved, it rechecks if the encoding has all the chunks complete, and if so, it triggers another recombine of all of the chunks. This process isn't triggered when a chunk goes from in-progress to successful, but any time a chunk is saved during or after being marked as successful, so any updates can trigger this. After the chunks are combined, it goes through and deletes the chunks, but if a bunch of them are saved while it's combining them it can cause multiple concurrent combine operations, resulting in multiple concurrent HLS creations when the combines are finished.
It's possible that by changing these check loops:
To check to make sure that each chunk is in a "success" state instead of just making sure that it has media, it should make it so it's not able to do concurrent combines.
Describe the issue Often times when transcoding to an MP4 profile one of the chunks of an encoding will come back as an encoding fail but the video seems to play fine if I pull the actual encoded file. The logs don't show any errors and look the same as successful chunks,
Retries
is0
,Status
isFail
, andTotal run time
is also0
. But then if I click onEncoding file
it plays just fine and everything looks okay. I also never see this with VP9 encoding.To Reproduce I'm not sure how to reproduce this. However, I did not see this problem until I moved to a server setup with much higher capacity but also higher disk latency. I've raised the timeouts for Celery to allow each task to run for considerably longer, far longer than successful chunks take to encode, but the issue persists.
Expected behavior I would expect that the chunk, which from everything I'm seeing did in fact succeed in encoding, would be marked as
Success
instead. What I find most odd is thatTotal run time
is0
- seems like even for a failure, that should be filled in with some kind of a value.Screenshots Can't share screenshots of this setup unfortunately.
Environment (please complete the following information):
Additional context Add any other context about the problem here.