spring-cloud / spring-cloud-dataflow

A microservices-based Streaming and Batch data processing in Cloud Foundry and Kubernetes
https://dataflow.spring.io
Apache License 2.0
1.11k stars 583 forks source link

Stopped Tasks show up as running although they've been effectively stopped #5320

Closed juanpablo-santos closed 1 year ago

juanpablo-santos commented 1 year ago

Description: We have some tasks running on a K8s cluster through SCDF 2.10.2. When we request to stop them, the tasks stop, the associated pods are removed but they still show up as RUNNING on SCF dashboard. Our tasks are spring batch based and we've added a listener similar to the one depicted at https://stackoverflow.com/q/66110545. While locally the listener seems to perform ok, it seems ignored when running on the cluster, or the other way round, the task still shows up as running although it has been effectively stopped.

Release versions:

{
  "versions": {
    "implementation": {
      "name": "spring-cloud-dataflow-server",
      "version": "2.10.2"
    },
    "core": {
      "name": "Spring Cloud Data Flow Core",
      "version": "2.10.2"
    },
    "dashboard": {
      "name": "Spring Cloud Dataflow UI",
      "version": "3.3.2"
    },
    "shell": {
      "name": "Spring Cloud Data Flow Shell",
      "version": "2.10.2",
      "url": "https://repo.maven.apache.org/maven2/org/springframework/cloud/spring-cloud-dataflow-shell/2.10.2/spring-cloud-dataflow-shell-2.10.2.jar"
    }
  },
  "features": {
    "streams": true,
    "tasks": true,
    "schedules": true,
    "monitoringDashboardType": "NONE"
  },
  "runtimeEnvironment": {
    "appDeployer": {
      "deployerImplementationVersion": "2.9.1",
      "deployerName": "Spring Cloud Skipper Server",
      "deployerSpiVersion": "2.9.2",
      "javaVersion": "1.8.0_362",
      "platformApiVersion": "",
      "platformClientVersion": "",
      "platformHostVersion": "",
      "platformSpecificInfo": {
        "default": "kubernetes"
      },
      "platformType": "Skipper Managed",
      "springBootVersion": "2.7.9",
      "springVersion": "5.3.25"
    },
    "taskLaunchers": [
      {
        "deployerImplementationVersion": "2.8.2",
        "deployerName": "KubernetesTaskLauncher",
        "deployerSpiVersion": "2.8.2",
        "javaVersion": "1.8.0_362",
        "platformApiVersion": "v1",
        "platformClientVersion": "unknown",
        "platformHostVersion": "unknown",
        "platformSpecificInfo": {
          "namespace": "scdf",
          "master-url": "https://10.96.0.1:443/"
        },
        "platformType": "Kubernetes",
        "springBootVersion": "2.7.9",
        "springVersion": "5.3.25"
      },
      {
        "deployerImplementationVersion": "2.8.2",
        "deployerName": "KubernetesTaskLauncher",
        "deployerSpiVersion": "2.8.2",
        "javaVersion": "1.8.0_362",
        "platformApiVersion": "v1",
        "platformClientVersion": "unknown",
        "platformHostVersion": "unknown",
        "platformSpecificInfo": {
          "namespace": "scdf",
          "master-url": "https://10.7.251.143:6443"
        },
        "platformType": "Kubernetes",
        "springBootVersion": "2.7.9",
        "springVersion": "5.3.25"
      }
    ]
  },
  "monitoringDashboardInfo": {
    "url": "",
    "source": "default-scdf-source",
    "refreshInterval": 15
  },
  "security": {
    "isAuthentication": false,
    "isAuthenticated": false,
    "username": null,
    "roles": []
  }
}

Custom apps: We're using normal Spring Batch based tasks. We try to gracefully shutdown them via a listener, as shown in https://stackoverflow.com/q/66110545 in order to avoid this issue, but haven't had success at it.

Steps to reproduce:

Screenshots: N/A

Additional context: N/A

juanpablo-santos commented 1 year ago

(just to clarify, database shows that the task is still running, so it doesn't seem to be related to UI - as soon as we manually fix the appropiate rows, everything is fine again)

cppwfs commented 1 year ago

Hello @juanpablo-santos , Could you provide a sample app that exhibits this behavior? I could not reproduce this issue as you described with a sample app that contained a single job with 2 steps.

juanpablo-santos commented 1 year ago

Hi @cppwfs ,

will work on a sample. Did you run the spring batch application inside a docker container? I'm feeling that the root cause is caused by https://github.com/spring-projects/spring-batch/issues/4023#issuecomment-1525701487 (our dockerfile calls entrypoint using exec syntax, so sigterm signals should be propagated, although they don't seem to end up on the shutdown hook).

thanks in advance

cppwfs commented 1 year ago

@juanpablo-santos I did create an image and deploy it to my kubernetes instance. I think Mahmoud brought up a good point. What is your entrypoint that you are using for your applications?

juanpablo-santos commented 1 year ago

apologies, badly written - what I was trying to ask was if the app was run on an platform != to the one hosting the scdf server, I've stumbled upon some issues with this before, and wanted to discard that.

As for the entrypoint is something like

ENTRYPOINT [ "./init.sh" ]

with init.sh being a script ending up in something like

java -cp ${CLASSPATH} ${JAVA_OPTIONS} ${LOGBACK_PARAMS} ${SOME_OTHER_PARAMS ${START_CLS} $@
cppwfs commented 1 year ago

Are there any exceptions in your logs when you run the app locally? Also look forward to the sample app. Thanks!

juanpablo-santos commented 1 year ago

Hi,

No, locally all is running fine, the hook gets called, etc. I'll begin with the sample app most probably next Monday/Tuesday.

Thanks for your continued support and looking into this :-)

juanpablo-santos commented 1 year ago

Hi @cppwfs ,

happy to say that the we've pinpointed the issue, and it doesn't have anything to do with SCDF, but with how the stop signaling works its way from kubernetes down to the java app. For reference,

With all that in place, the SIGTERM signal ends up arriving to the application, our shutdownHook gets executed, etc. However, if using tini, this signaling won't stop the pod from dying after the usual 30 seconds, possibly rendering the application in RUNNING state if your graceful shutdown takes more than that time to finish; you'll have to either use pid1 instead of tini, which allows a timeout or the new terminationGracePeriodSeconds parameter introduced on SCDF 2.10.3. We'll be going this way, so we're waiting on the 2.10.3 version of the helm charts to be released by the bitnami team.

Nothing of the above is specific to SCDF, but it would be very useful to have a small section on the documentation referring them, although don't know where would be the best place to place it. In our case, this article was a life saver and allowed us to dig into the right direction.

Last but not least, thank you again for your continued support and for looking into this issue, I'll proceed with closing the issue.

cppwfs commented 1 year ago

I'm so glad ya'll found the solution and thank you for sharing!