livepeer / task-runner

Background service that executes tasks from the Livepeer API. Mainly used for VOD.
MIT License
3 stars 2 forks source link

Import tasks getting stuck mid execution #19

Open victorges opened 2 years ago

victorges commented 2 years ago

We have some weird cases of tasks getting stuck mid execution. These tasks just keep getting re-executed, since the RabbitMQ will eventually timeout (after 30m) waiting on an ACK and just terminates the connection with the client (nacking all messages it had in-flight). The task is then re-executed as if nothing happened [1].

[1] This is also another bug that we should address. Should just fail if the task had already tried running before and just disappeared, which we can already tell from the metadata in the API. This is not the root cause though so we still need to investigate and fix the stuck tasks.

No logs that indicate what is wrong, but I have a light suspicion on either:

On the first tasks I found this error, they were actually importing large stream recordings which take 12+ minutes to download on a good connection, due to the on-demand MP4 generation bottleneck. It was already weird since we have a hard timeout of 10 minutes so the task runner should have just failed the task, instead of gone silent.

Right now I just found an even weirder case though. It was from a regular "import" task, which is not importing a recording but actually just another asset as a test that the user was making. This is the task:

{
    "id": "51ea2a1e-618e-452d-a024-7c5a0ace266f",
    "type": "import",
    "params": {
        "import": {
            "url": "https://livepeercdn.com/asset/REDACTED/video"
        }
    },
    "status": {
        "phase": "running",
        "progress": 0.649,
        "updatedAt": 1651269956139
    },
    "userId": "REDACTED",
    "createdAt": 1650886712179,
    "outputAssetId": "4582de3b-ead3-4ffe-8b6d-b130f61290a1"
}

The asset has around 5GB and takes less than a minute to download from a good connection, so there's no clear reason of why the task-runner is getting stuck.