We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting
re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just
terminates the connection with the client (nacking all messages it had in-flight). The task is then
re-executed as if nothing happened [1].
[1] This is also another bug that we should address. Should just fail if the task had already tried
running before and just disappeared, which we can already tell from the metadata in the API. This
is not the root cause though so we still need to investigate and fix the stuck tasks.
No logs that indicate what is wrong, but I have a light suspicion on either:
the "progress reporting" logic
the "piping" logic in the import task which sends a stream both to ffprobe and to the storage
the S3 upload client
On the first tasks I found this error, they were actually importing large stream recordings which take
12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck.
It was already weird since we have a hard timeout of 10 minutes so the task runner
should have just failed the task, instead of gone silent.
Right now I just found an even weirder case though. It was from a regular "import" task, which is not
importing a recording but actually just another asset as a test that the user was making. This is the
task:
The asset has around 5GB and takes less than a minute to download from a good connection, so there's
no clear reason of why the task-runner is getting stuck.
We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just terminates the connection with the client (nacking all messages it had in-flight). The task is then re-executed as if nothing happened [1].
No logs that indicate what is wrong, but I have a light suspicion on either:
ffprobe
and to the storageOn the first tasks I found this error, they were actually importing large stream recordings which take 12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck. It was already weird since we have a hard timeout of 10 minutes so the task runner should have just failed the task, instead of gone silent.
Right now I just found an even weirder case though. It was from a regular "import" task, which is not importing a recording but actually just another asset as a test that the user was making. This is the task:
The asset has around 5GB and takes less than a minute to download from a good connection, so there's no clear reason of why the task-runner is getting stuck.