kestra-io / kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
https://kestra.io
Apache License 2.0
7.58k stars 463 forks source link

Webserver crashes after few API calls from external server after upgrade to 0.15.8 #3402

Open aballiet opened 5 months ago

aballiet commented 5 months ago

Describe the issue

Webserver was working fine in 0.14.4, after upgrading to 0.15.8, webserver systematically crashes after few API calls (2 calls).

We tried a bunch of things and noticed the server struggles globally.

Here is an example of a flow which created a lot of scaling events on Kestra server and can make it crash :

id: kestra-webserver-crasher
namespace: dev
tasks:
- id: spike-webserver-memory
  type: io.kestra.plugin.scripts.python.Script
  runner: PROCESS
  beforeCommands:
  - "pip install requests > /dev/null"
  warningOnStdErr: false
  docker:
    pullPolicy: ALWAYS
  script: |
    import requests

    BASE_SERVICE_URL = "http://kestra-service.kestra:8080/"

    # Get all running excections
    search = {"size":20}
    response = requests.get(f"{BASE_SERVICE_URL}/api/v1/executions/search", params=search)
    total = response.json()["total"]
    if total == 0:
        print("No executions currently running")
    else:
        print(f"{total} executions to fetch logs for")
        executions = [
            {"id": result["id"], "flow": result["flowId"], "namespace": result["namespace"]}
            for result in response.json()["results"]
        ]
        for execution in executions:
            id = execution["id"]
            log_search = {"q": ""}
            logs = requests.get(f"{BASE_SERVICE_URL}/api/v1/logs/{id}", params=log_search)
            print(logs)

Environment

loicmathieu commented 5 months ago

Hi, Do you have some logs on the Kestra webserver side?

aballiet commented 5 months ago

updated the issue @loicmathieu

loicmathieu commented 5 months ago

There is a workaround with the following configuration:

  micronaut:
    server:
      max-request-size: 1GB
      netty:
        server-type: full_content

But it would limit the request size to 1GB and silently ignore bigger requests, this can be an issue if you start a flow with total inputs size of more than 1GB.

We're still working with the Micronaut team to find a proper fix, see https://github.com/micronaut-projects/micronaut-core/issues/10677

tchiotludo commented 3 months ago

@aballiet do you still have the issues?

yuri1969 commented 1 month ago

Just leaving a note here - v0.16.10 is also affected.

loicmathieu commented 3 days ago

@aballiet @yuri1969 just discovered a bug in Micronaut (not easily reproducible) with POST request including a payload when returning a 404 not found that hang leading to memory / CPU issue. I wonder if it could be the cause.

So, do you se any webhook that you call with a POST method and a payload? Do you sometimes, by inadvertance, use it with a flow or webhook key non-existent?

yuri1969 commented 3 days ago

@loicmathieu As far as our deployment is concerned it starts executions exclusively by POSTing to /api/v1/executions/<namespace>/<flow_id> - no webhooks are used. Nearly every flow defines two required inputs of the FILE type.

Occasional misconfiguration incidents lead to POSTing to a nonexistent <namespace>/<flow_id> flow which might trigger HTTP 404 responses.

loicmathieu commented 2 days ago

I stumbled upon a bug when an exception is returned when creating the execution; see https://github.com/kestra-io/kestra/pull/4836.

So if you encounter this issue after multiple calls to the create execution endpoint for a disabled flow or a flow that cannot be read from the database, this will be fixed by the PR. If it's not the case, we will need further investiguation.