Implements resume of container for the Docker task runner

nidomiro commented 3 months ago

Feature description

If kestra is restarted during a flow-execution, the execution is in an kind of undefined state after kestra is up again. To be completely clear here, I do not talk about kestra not being able to start new flow executions while it is shut down. That would simply be impossible. I'm talking about kestra saving the reference to the docker-container it started for the flow (or flow step) execution and checking the logs, status, ... on startup.

Also the docker-containers of flows that where executing during the kestra reboot are not cleaned up. But I think this will solve itself with the first problem.

Steps to reproduce:

Open a terminal at the path of the docker-compose file
Execute docker compose up
In the Browser navigate to the kestra instance and create a new Flow with the content below
Execute the flow and observe its logs
If you see the message starting with Starting Job go to your terminal with the running kestra instance and press CTRL + c to shut it down.
Wait at least 30 seconds. (That's what I did, but maybe this is not necessary)
Start Kestra again and observe the execution you triggered before shutting down
The job will appear to do something, but in fact will do nothing. After about 6 min the job finished successfully, but as you can see in the logs, it was started two times. The "Attempts" counter is still at one.

The test-flow

id: selfupdate
namespace: dev
description: Update Kestra itself with docker-compose

labels:
  env: dev

tasks:
  - id: pull_images
    type: io.kestra.plugin.scripts.shell.Commands
    commands:
      - export EXECUTION_DATE=$(date) # keep the date constant, so we can track if the first or the second start finished
      - echo Starting Job $EXECUTION_DATE
      - sleep 30
      - echo Finished Job $EXECUTION_DATE

The logs

2024-06-17T08:32:31.407Z DEBUG Image pulled [ubuntu:latest]
2024-06-17T08:32:31.589Z DEBUG Starting command with container id ec04b5892a4f514645efe8f4d7e2d00d04786d8aec1047afa198e6af1892829b [/bin/sh -c set -e
export EXECUTION_DATE=$(date)
echo Starting Job $EXECUTION_DATE
sleep 30
echo Finished Job $EXECUTION_DATE]
2024-06-17T08:32:31.592Z INFO Starting Job Mon Jun 17 08:32:31 UTC 2024
2024-06-17T08:37:53.232Z DEBUG Image pulled [ubuntu:latest]
2024-06-17T08:37:53.420Z DEBUG Starting command with container id 84b294832e5e3b213c49f1dcef08baaa562a35be7524513ad055fa5b47f5c352 [/bin/sh -c set -e
export EXECUTION_DATE=$(date)
echo Starting Job $EXECUTION_DATE
sleep 30
echo Finished Job $EXECUTION_DATE]
2024-06-17T08:37:53.425Z INFO Starting Job Mon Jun 17 08:37:53 UTC 2024
2024-06-17T08:38:23.421Z INFO Finished Job Mon Jun 17 08:37:53 UTC 2024
2024-06-17T08:38:23.576Z DEBUG Command succeed with code 0

Origin: https://kestra-io.slack.com/archives/C03FEC452NQ/p1719171554720979

loicmathieu commented 3 months ago

Script tasks use task runners, and not all task runners support resuming a previous execution.

The Docker task runner didn't support it, as in a typical production environment like when Kestra is deployed in Kubernetes, you usually use Docker-In-Docker, so the script task container is stopped when the Kestra container is stopped as all are in the same pod.

However, as Docker supports labels, we can implement resuming containers as we did for Kubernetes. We need to decide whether it should be by default or not.

tchiotludo commented 3 months ago

Quick though, we could add a plugin configuration that will enable this resume, but disabled by default, allowing people on mono instance or mono worker to rely on this, but since Kestra is multiple node minded, no reason to enable this by default.

kestra-io / kestra