Large amount of triggers and executions lead to performance issues

jgoguette commented 1 day ago

Describe the issue

When a large amount of triggers/executions occur the application starts to slow down immensely. Executions are created, but no tasks actually start even when it's in the Running state. And If tasks start they take a very long time to actually execute.

If tried setting up concurrency. With that I just get a huge backlog of executions to process.

This is a screenshot showing the current situation:

Examples showing execution created hours ago and it currently in the Running state yet no tasks are created:

My current Setup for the triggers include:

3 Real time debezium mysql triggers -- 1 constant load (200+ per minute) -- 1 hourly bulk load (30K+ per hour) -- 1 minimal load (50+ per hour)
1 Webhook -- Constant load (Roughly 1500+ per minute)

Using pg_stat_statements postgresql extension with the following query:

SELECT
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time
FROM
    pg_stat_statements
ORDER BY
    total_exec_time DESC
LIMIT 10;

I was able to identify the top slowest queries which you can find in this google sheet linked HERE

Screenshot for quick glance:

Just in case this is needed, I've included the database details:

kestra=> SELECT
    n.nspname || '.' || c.relname AS full_table_name,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
    pg_size_pretty(pg_relation_size(c.oid)) AS table_size,
    pg_size_pretty(pg_total_relation_size(c.oid) - pg_relation_size(c.oid)) AS indexes_size
FROM
    pg_class c
    JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE
    c.relkind = 'r'  -- Only base tables
    AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY
    pg_total_relation_size(c.oid) DESC
LIMIT 10;
      full_table_name      | total_size | table_size | indexes_size
---------------------------+------------+------------+--------------
 public.queues             | 28 GB      | 7288 MB    | 21 GB
 public.executions         | 4191 MB    | 1050 MB    | 3141 MB
 public.execution_queued   | 1460 MB    | 702 MB     | 758 MB
 public.logs               | 570 MB     | 412 MB     | 158 MB
 public.metrics            | 294 MB     | 233 MB     | 60 MB
 public.executorstate      | 265 MB     | 214 MB     | 51 MB
 public.service_instance   | 1032 kB    | 976 kB     | 56 kB
 public.worker_job_running | 432 kB     | 8192 bytes | 424 kB
 public.flows              | 336 kB     | 48 kB      | 288 kB
 public.triggers           | 160 kB     | 80 kB      | 80 kB
(10 rows)

Environment

Kestra Version: latest

I'm using AWS:

EC2: m6i.4xlarge
RDS: db.r6g.4xlarge
Docker Swarm:
```
volumes:
kestra-data:
driver: local
```

services: kestra-webserver: image: kestra/kestra:v0.19.9 deploy: replicas: 1 restart_policy: condition: on-failure max_attempts: 30 env_file: &env

.env_encoded user: "root" command: server webserver volumes: &volumes
kestra-data:/app/storage
/var/run/docker.sock:/var/run/docker.sock
/tmp/kestra-wd:/tmp/kestra-wd environment: &configuration KESTRA_CONFIGURATION: | datasources: postgres: url: jdbc:postgresql://endpoint:5432/kestra driverClassName: org.postgresql.Driver username: username password: password kestra: jdbc: cleaner: initialDelay: 1h fixedDelay: 1h retention: 3d tutorialFlows: enabled: false server: basic-auth: enabled: false username: "admin@kestra.io" # it must be a valid email address password: kestra repository: type: postgres storage: type: s3 s3: accessKey: "accessKey" secretKey: "secretKey" region: "region" bucket: "bucket" queue: type: postgres tasks: tmp-dir: path: /tmp/kestra-wd/tmp url: https://url.com variables: globals: env: staging environment: name: Staging color: "#FCB37C" ports:
"80:8080"
"8081:8081" networks:
shared-overlay-network

kestra-worker-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-executor-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-worker-2: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-executor-2: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-worker-3: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-executor-3: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

kestra-scheduler-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server scheduler volumes: volumes environment: *configuration depends_on:
- kestra-webserver networks:
shared-overlay-network

networks: shared-overlay-network: external: true name: shared-overlay-network

loicmathieu commented 1 day ago

Hi, Thanks for the detailed report. Some query performance is surprising, so I need to dig to understand what's going on.

As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?

All services are started once in a big node, and you have a big database, regarding the database spec it should be enough so I need to look at the query performances.

I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?

loicmathieu commented 21 hours ago

4 of the slow queries, the ones with an execution time of more than 1s, are from dashboards (home, flow, execution), if you have a lot of executions it is normal that those queries are slow, there is today only one possibility is to purge more often the executions inside the database.

Reducing the dashboard period would also makes things better but we didn't offer a global settings for that yet.

The first query is to pop a queued execution from the list of queued executions, we use a lock to be able to select then delete the record from the database. I checked and it correctly use the index. I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.

Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.

jgoguette commented 21 hours ago

As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?

Each flow has only 2 tasks: 1 to log which topic, and another to produce a kafka message

I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?

All 4 of my flows had concurrency limit set in the screenshot above.

I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.

I set the concurrency limit on each of the 4 flows to about 125 (500 total)

Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.

So is the bottleneck here only the database? And to simply increase the specs? I'd imagine db.r6g.4xlarge would be sufficient enough for the work.

loicmathieu commented 21 hours ago

Can you try to create an index to see if it provides any improvements?

create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");

I'd imagine db.r6g.4xlarge would be sufficient enough for the work.

It depends how many executions you keep in the database, we have nice dashboards that shows execution overview for the last 30 days, those can consume a lot of database resource if they are displayed frequently. That's why I talked about purging executions. If you didn't already purge executions and have a high number of executions in the database, this may explain some slowness.

I set the concurrency limit on each of the 4 flows to about 125 (500 total)

I would not expect that! Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number. It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!

jgoguette commented 17 hours ago

This is what I'm working with right now without the concurrency.

Once i removed the concurrency limit on the tasks it seemed to perform a lot better.

Since I cleared my db:

create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");

I'm won't be able to test this out with my previous load in the screenshot :(

I would not expect that! Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number. It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!

This makes it clear to me. Thanks for the explanation.

Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?

I'm currently using:

EC2: 
- Type: m6i.4xlarge
- CPU: 16
- Memory: 64GB

What would be the appropriate amount of worker threads to set for this setup?

loicmathieu commented 4 hours ago

Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?

By default, it will use 4 times the number of CPU cores, which is a sane default. As CPU utilization will depend on the type of tasks you run, we cannot do anything smarter. For ex, if you use a lot of I/O tasks, you may use more threads are those would most probably been waiting for I/O, but if you do more CPU intensive tasks (calculation or mapping) it may already be a lot.

If you see low CPU utilization during load, you can increase the number of threads but the default configuration should be a good compromise.

loicmathieu commented 4 hours ago

Anyway, as you validate that concurrency limit is what caused the issue it help us to find some performance improvements in this area! I would also try to improve documentation.

Thanks for your detailed feedback, it help us ;)

kestra-io / kestra

Large amount of triggers and executions lead to performance issues #6024

Describe the issue

Environment