kestra-io / kestra

:zap: Workflow Automation Platform. Orchestrate & Schedule code in any language, run anywhere, 500+ plugins. Alternative to Zapier, Rundeck, Camunda, Airflow...
https://kestra.io
Apache License 2.0
13.6k stars 1.17k forks source link

Webserver Crashes with - Thread starvation or clock leap detected #4754

Open johnkm516 opened 3 months ago

johnkm516 commented 3 months ago

Describe the issue

Not sure if this is an infrastructure issue or an issue with the Kestra webserver, but on my Portainer which I use to manage my Docker swarm, I am getting several crashes of the webserver with this log :

2024-08-26 08:16:59,147 INFO  default-nioEventLoopGroup-1-30 io.kestra.webserver.access 2024-08-26T08:16:59.140Z | POST /api/v1/filaholdings-dev/stats/executions/daily HTTP/1.1 | status: 200 | ip: 10.20.20.23 | length: 332 | duration: 6
2024-08-26 08:16:59,258 INFO  default-nioEventLoopGroup-1-30 io.kestra.webserver.access 2024-08-26T08:16:59.252Z | GET /api/v1/filaholdings-dev/executions/search?size=25&page=1&sort=state.startDate:desc&namespace=erp.scm.bat&flowId=BAT_SCM_EI_CARGOTRACKING_M HTTP/1.1 | status: 200 | ip: 10.20.20.23 | length: 838 | duration: 5
2024-08-26 08:17:01,996 INFO  default-nioEventLoopGroup-1-30 io.kestra.webserver.access 2024-08-26T08:17:01.987Z | GET /api/v1/filaholdings-dev/executions/5T5w8ExoFeefTZy0dF9qml/follow HTTP/1.1 | status: 200 | ip: 10.20.20.23 | length: 664 | duration: 9
2024-08-26 08:17:03,765 INFO  default-nioEventLoopGroup-1-29 io.kestra.webserver.access 2024-08-26T08:17:03.750Z | GET /api/v1/filaholdings-dev/executions/5T5w8ExoFeefTZy0dF9qml/graph HTTP/1.1 | status: 200 | ip: 10.20.20.23 | length: 858 | duration: 15
2024-08-26 08:18:06,313 WARN  HikariPool-1 housekeeper com.zaxxer.hikari.pool.HikariPool HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=47s41ms248µs101ns).

My Docker Swarm deployment looks like this :

image

The webserver component has java heap space of 12G.

I looked around, and the most common cause of the error is due to excessive garbage collection on the JVM. Select queries on the webserver might be too large in some cases when accessing certain pages on the frontend, causing the error. The problem is the issue is difficult to track down due to the the log not being very helpful.

Environment

loicmathieu commented 3 months ago

Hi, This is a warning that should not crash the webserver but may be a sign of something wrong going on ... or an artefact of some VM or OS doing clock alignment (which can occurs due to virtualization or NTP alignment). It can also be caused by the CPU being too busy for the housekeeper thread of the database connection pool to do its jo.

Anyway, this should not cause the webserver to crash, so we first need to understand what caused it to stop working. Do you have any error log at the time of the crash? Can you check CPU and memory consumption at the time of the crash? I'm not familiar with Docker Swarm but is there a way to retrieve the container exit code and reason?

johnkm516 commented 3 months ago

Hi @loicmathieu,

My first thought was lack of resources, which is why I gave the webserver more than enough heap space (12G). The VM monitoring solution that we use show no logs of lack of CPU / RAM, we have alerts in place that alert high utilization of CPU / RAM.

In terms of the individual container, I use Portainer to manage and see resource utilization of each container, but unfortunately unless I'm sitting here monitoring the resource utilization of a running webserver during the time of the crash, Portainer does not save a snapshot of resource utilization on containers that have already crashed so I have no way of knowing what the utilization was. Here's the exit code of the container :

image

In terms of logs, unfortunately the single line of log that I show in my OP is all that I see. I can send you the full logs but they're all normal API call logs, the only error I see is the single line showing clock leap.

I'm a bit suspicious as to resources being the problem as the VM that the webserver is deployed on has plenty of resources, and I do not think the 12G of heap space I gave the webserver should be lacking in terms of memory. Another thing to consider is that I have lots of other containers running on the same VM, including an Executor component which I run on all 4 VMs, and the other containers have never crashed. I would see it as a VM error if all containers running on a specific VM crashed, but this is not the case. In fact I have two instances of the webserver running on two different VMs, and as you can see from the above image, both webservers crashed with the same error, on two different hosts, while all other containers running on the two hosts are fine.

I wish the log would be more helpful.... Is there anything I can do to make it more verbose? Here's my Kestra config for logs :

        logger:
          levels:
            io.kestra.runner: TRACE
            io.netty.handler.logging: DEBUG

Edit : I checked the error code and the error code signifies out of memory. I'm not sure how though? All of the hosts have 64GB of RAM and the webserver component itself has 12G of heap space, which should be plenty. What is there on the webserver that could use up so much memory?

loicmathieu commented 3 months ago

Exicode 137 means that something sends a SIGKILL to the container, so the container was killed by Swarm. You should look at Swarm to understand why it kills the container (usual suspect is the container uses too much memory, more than what it reserves but it may be caused by something else).

We have an ongoing issue with webserver memory pressure when using inputs, we have sporadic report of such issue but didn't success to reproduce it. Nobody report it recently so we thought it has been fixed by some library update, see https://github.com/kestra-io/kestra/issues/3402.

johnkm516 commented 2 months ago

Hi @loicmathieu ,

I've added

  micronaut:
    server:
      max-request-size: 4GB
      netty:
        server-type: full_content

to the webserver deployment. Just to be clear, the 4GB limit is per request, right? I don't think I'll ever have any request larger than 4GB, so it should be fine as long as this workaround stops the webserver from crashing.

I'll report back in a few days and see if this resolves the issue. I've had this issue crash the webserver pretty consistently the last few days (still don't know the cause) so if I don't have the issue for a few days I can confidently say the issue is fixed with the above.

loicmathieu commented 2 months ago

Ok, let me know how it goes. Do you use webhook With Post http method and a payload? We discovered a bug in a library we use and fixed it recently and it could have the same behavior.

johnkm516 commented 2 months ago

Hi @loicmathieu ,

Having a bit of trouble deploying with the above settings. I tried doing

  micronaut:
    server:
      max-request-size: 4GB
      netty:
        server-type: full_content

and it wouldn't let me log in, the response for login was returning 415 request size limit exceeded. I was guessing perhaps setting the max-request-size overrides other defaults for micronaut, so I tried deploying with this :

        micronaut:
          server:
            max-request-size: 10GB
            multipart:
              max-file-size: 10GB
              disk: true
            read-idle-timeout: 60m
            write-idle-timeout: 60m
            idle-timeout: 60m
            netty:
              max-chunk-size: 10MB
              server-type: full_content

which are the defaults shown in the Kestra documentation, and config doesn't even deploy the webserver with the following error :

aused by: io.micronaut.context.exceptions.BeanInstantiationException: Error instantiating bean of type  [io.micronaut.security.oauth2.client.DefaultOpenIdProviderMetadata]
Message: maxContentLength : -2147483648 (expected: >= 0)
Path Taken: OpenIdClientFactory.openIdConfiguration(OauthClientConfiguration oauthClientConfiguration,OpenIdClientConfiguration openIdClientConfiguration,OpenIdProviderMetadataFetcher openIdProviderMetadataFetcher)
    at io.micronaut.context.DefaultBeanContext.resolveByBeanFactory(DefaultBeanContext.java:2343)
    at io.micronaut.context.DefaultBeanContext.doCreateBean(DefaultBeanContext.java:2298)
    at io.micronaut.context.DefaultBeanContext.doCreateBean(DefaultBeanContext.java:2310)
    at io.micronaut.context.DefaultBeanContext.createRegistration(DefaultBeanContext.java:3120)
    at io.micronaut.context.SingletonScope.getOrCreate(SingletonScope.java:80)
    at io.micronaut.context.DefaultBeanContext.findOrCreateSingletonBeanRegistration(DefaultBeanContext.java:3022)
    at io.micronaut.context.DefaultBeanContext.initializeEagerBean(DefaultBeanContext.java:2686)
    at io.micronaut.context.DefaultBeanContext.initializeContext(DefaultBeanContext.java:1993)
    ... 7 more
Caused by: java.lang.IllegalArgumentException: maxContentLength : -2147483648 (expected: >= 0)
    at io.netty.util.internal.ObjectUtil.checkPositiveOrZero(ObjectUtil.java:144)
    at io.netty.handler.codec.MessageAggregator.validateMaxContentLength(MessageAggregator.java:88)
    at io.netty.handler.codec.MessageAggregator.<init>(MessageAggregator.java:77)
    at io.netty.handler.codec.http.HttpObjectAggregator.<init>(HttpObjectAggregator.java:128)
    at io.netty.handler.codec.http.HttpObjectAggregator.<init>(HttpObjectAggregator.java:115)
    at io.micronaut.http.client.netty.DefaultHttpClient$3.<init>(DefaultHttpClient.java:1111)
    at io.micronaut.http.client.netty.DefaultHttpClient.lambda$exchangeImpl$27(DefaultHttpClient.java:1111)
    at reactor.core.publisher.MonoFlatMapMany$FlatMapManyMain.onNext(MonoFlatMapMany.java:163)
    at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
    at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
    at io.micronaut.http.client.netty.CancellableMonoSink.tryForward(CancellableMonoSink.java:69)
    at io.micronaut.http.client.netty.CancellableMonoSink.tryEmitValue(CancellableMonoSink.java:86)
    at io.micronaut.http.client.netty.ConnectionManager$Pool$ConnectionHolder.emitPoolHandle(ConnectionManager.java:1215)
    at io.micronaut.http.client.netty.ConnectionManager$Pool$Http1ConnectionHolder.dispatch0(Connec

So i tried modifying the config to this :

        micronaut:
          http:
            client:
               max-content-length: 2GB
          server:
            max-request-size: 10GB
            multipart:
              max-file-size: 10GB
              disk: true
            read-idle-timeout: 60m
            write-idle-timeout: 60m
            idle-timeout: 60m
            netty:
              max-chunk-size: 10MB
              server-type: full_content

and I still get the same error logs. Do you have any recommendations for the config?

loicmathieu commented 2 months ago

When using server-type: full_content the request is fully loaded into memory, including all multipart parts if any, so it cannot be more than 2GB, I strongly advise to set it to less.

Can you respond to my question about webhook? It may be the cause:

Do you use webhook With Post http method and a payload?

johnkm516 commented 2 months ago

When using server-type: full_content the request is fully loaded into memory, including all multipart parts if any, so it cannot be more than 2GB, I strongly advise to set it to less.

Can you respond to my question about webhook? It may be the cause:

Do you use webhook With Post http method and a payload?

No, I do not have any webhook triggers currently.

This isn't a sure reproduction, but from what I noticed, the webserver seems to crash most often when trying to load the executions page, or particularly when there are dozens of executions being triggered at once (while on the executions page?).

I have a lot of DB to DB syncs which have SQL triggers, polling every 3 seconds. When the flow is triggered it invokes subflows for each row. While this is fine when the flow executes properly and changes the trigger flag to stop the execution from triggering again, if the flow fails, the trigger will keep triggering and failing every 3 seconds (concurrency is set to 1) which can lead to dozens of executions being added every 3 seconds. The clock leap seems to occur quite consistently in this case.

In short,

1) Create any flow that triggers 100+ flows every 3 seconds, preferably the trigger should contain some inputs 2) Go to the Executions page, it'll load very slowly. 3) Spam refresh on the Executions page, to simulate multiple users accessing the page.

I'll try to redeploy with the limit set to 2GB.

loicmathieu commented 2 months ago

@johnkm516 thanks for the additional information, it'll be helpful

loicmathieu commented 2 months ago

@johnkm516 I tried to reproduce the issue by adding multiple triggers that, each 3s, will generates 100 flows with an input. I'm using this flow that 1 replicate 5 times so each 3s;, 500 subflows are triggered:

id: subflow-spammer
namespace: company.team

triggers:
  - id: jdbc
    type: io.kestra.plugin.jdbc.postgresql.Trigger
    url: jdbc:postgresql://localhost:5432/kestra_unit
    username: *****
    password: *****
    sql: select namespace, flow_id, state_current, state_duration, start_date, end_date from executions limit 100
    fetch: true
    interval: PT3S

tasks:
  - id: each
    type: io.kestra.plugin.core.flow.ForEach
    values: "{{ trigger.rows }}"
    tasks:
    - id: hello
      type: io.kestra.plugin.core.flow.Subflow
      namespace: company.team
      flowId: subbflow
      wait: false
      inputs:
        value: "{{taskrun.value}}"

I then refresh the executions page a lot without any issues. So unless you'll be able to provide a reproducer with a set of flows I don't know how we can make any progress

johnkm516 commented 2 months ago

It's happening pretty consistently for me :

image

I'll try to get a more reliable reproducer for you.

loicmathieu commented 2 months ago

How much executions are you created per hours? How much CPU and memory the webserver have (and replicas if you start it more than once).

johnkm516 commented 2 months ago

Taking the number of executions in the last 7 days and dividing to get the average :

398746 executions / 7 days / 24 hours = 2373.488 executions per hour. The average runtime of these flows are about 5~10 seconds, with batch jobs running longer but are simply remote execution of JAR files via ssh.

There are no memory and CPU limits in the docker compose YAML of my deployment. The limit is set by the JAVA_OPT variable instead. Here is the deploy portion of my docker compose on the web server component :

    deploy:
      mode: replicated
      replicas: 2
      placement:
        max_replicas_per_node: 1
        constraints:
          - node.labels.kestra!=worker

Here is the relevant env variable :

    environment:
      JAVA_OPTS: "-Xms12g -Xmx12g"

AFAIK, if I don't set any CPU or memory limits in the deploy section the container is free to use whatever resources are available. As per the OP each host has 64GB of RAM and 16 cores to use. I set the java heap space to be well below the 64GB in case my MinIO uses a lot of memory, but the MinIO deployment itself is used exclusively by Kestra.

I see no issues in terms of performance, errors, or crashes with any other container on the same host when the webserver crashes. It's exclusively the webserver that crashes with exit code 137, other crashes on the executor and scheduler are for different reasons (exit code 0 or 1) such as DB deadlock errors which I reported separately and are not relevant.

loicmathieu commented 2 months ago

@johnkm516 thanks a lot fo all this information.

johnkm516 commented 2 months ago

@loicmathieu @tchiotludo I FINALLY FOUND A REPRODUCER!!!

I spent like an hour recording myself messing around in the UI, and tracked it down to the following steps :

1) Go to the Executions page. 2) Click on any Running executions. 3) go to the Topology page. 4) If the execution has not finished by the time you load the Topology page, the Topology page will consistently show Axios error timeouts. 5) Reload the Topology page. Either Step 4 or Step 5 will cause Axios timeout errors. 6) Go to the Executions page.

If the webserver does not crash right away, repeat from 1-6 until the webserver crashes. It only takes one or two tries for me to consistently get it to crash with exit code 137.

It might not be only the Topology page that causes the crash. From my experience I think anytime an Axios timeout error occurs it causes a crash on the webserver, but in general the error occurs when navigating to a tab of any running execution. I found the Topology tab to be the most consistent in causing the crash, but I've seen the Overview, Gantt, and Logs tab also cause a crash but it's not as consistent.

Here is a video :

crash

loicmathieu commented 2 months ago

@johnkm516 very interesting! Can you check if you have some logs on the server at the time of timeout (or close after). You may search for broken pipe message or something like that.

It may be caused by some resources not being correctly cleaned on the webserver in case of connection broken by the UI.

johnkm516 commented 2 months ago

No helpful server logs unfortunately. Same as OP, all I see is an exit code 137 along with one line of relevant logs saying

Thread starvation or clock leap detected

johnkm516 commented 1 month ago

@loicmathieu Were you able to reproduce?

loicmathieu commented 1 month ago

@johnkm516 no, we cannot reproduce it on our side.

loicmathieu commented 18 hours ago

Hi @johnkm516 we made a bunch of fixes and improvement that may impact memory usage in latest 0.19 version. Can you give feedback whether or not you still see these issues?