Open jgoguette opened 1 day ago
Hi, Thanks for the detailed report. Some query performance is surprising, so I need to dig to understand what's going on.
As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?
All services are started once in a big node, and you have a big database, regarding the database spec it should be enough so I need to look at the query performances.
I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?
4 of the slow queries, the ones with an execution time of more than 1s, are from dashboards (home, flow, execution), if you have a lot of executions it is normal that those queries are slow, there is today only one possibility is to purge more often the executions inside the database.
Reducing the dashboard period would also makes things better but we didn't offer a global settings for that yet.
The first query is to pop a queued execution from the list of queued executions, we use a lock to be able to select then delete the record from the database. I checked and it correctly use the index. I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.
Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.
As I understand you have almost 2k executions created by minute, which is a lot, but still something Kestra should be able to handle. Do you know how many tasks you have by executions?
Each flow has only 2 tasks: 1 to log which topic, and another to produce a kafka message
I also see that there is queries on the executions_queued table so you should have one flow with concurrency limit, is it the one triggered by webhook?
All 4 of my flows had concurrency limit set in the screenshot above.
I suspect you have a lot of concurrent executions so it contends on the database lock, we'll discuss internally but we may be able to improve that a bit.
I set the concurrency limit on each of the 4 flows to about 125 (500 total)
Anyway, adding a database with more resources would also help as it seems the database is not able to cope with the high number of executions you are creating.
So is the bottleneck here only the database? And to simply increase the specs? I'd imagine db.r6g.4xlarge
would be sufficient enough for the work.
Can you try to create an index to see if it provides any improvements?
create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");
I'd imagine db.r6g.4xlarge would be sufficient enough for the work.
It depends how many executions you keep in the database, we have nice dashboards that shows execution overview for the last 30 days, those can consume a lot of database resource if they are displayed frequently. That's why I talked about purging executions. If you didn't already purge executions and have a high number of executions in the database, this may explain some slowness.
I set the concurrency limit on each of the 4 flows to about 125 (500 total)
I would not expect that! Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number. It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!
This is what I'm working with right now without the concurrency.
Once i removed the concurrency limit on the tasks it seemed to perform a lot better.
Since I cleared my db:
create index execution_queued__flow_date on execution_queued(tenant_id, namespace, flow_id, "date");
I'm won't be able to test this out with my previous load in the screenshot :(
I would not expect that! Concurrency limit is not to limit the number of parallel task processing, this is limited by the worker thread pool. Concurrency limit is to avoid running too many flows in parallel to protect the target system, or in case you want to have only one execution at a time. So usually you didn't add in all your flows and with a lower number. It uses database lock under the cover, so here you have a lot of executions concurrently locking the database which can explain a lot!
This makes it clear to me. Thanks for the explanation.
Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?
I'm currently using:
EC2:
- Type: m6i.4xlarge
- CPU: 16
- Memory: 64GB
What would be the appropriate amount of worker threads to set for this setup?
Do kestra have any recommendations for the amount of threads to set for workers depending on what ec2 instance is used?
By default, it will use 4 times the number of CPU cores, which is a sane default. As CPU utilization will depend on the type of tasks you run, we cannot do anything smarter. For ex, if you use a lot of I/O tasks, you may use more threads are those would most probably been waiting for I/O, but if you do more CPU intensive tasks (calculation or mapping) it may already be a lot.
If you see low CPU utilization during load, you can increase the number of threads but the default configuration should be a good compromise.
Anyway, as you validate that concurrency limit is what caused the issue it help us to find some performance improvements in this area! I would also try to improve documentation.
Thanks for your detailed feedback, it help us ;)
Describe the issue
When a large amount of triggers/executions occur the application starts to slow down immensely. Executions are created, but no tasks actually start even when it's in the
Running
state. And If tasks start they take a very long time to actually execute.If tried setting up concurrency. With that I just get a huge backlog of executions to process.
This is a screenshot showing the current situation:
Examples showing execution created hours ago and it currently in the
Running
state yet no tasks are created:My current Setup for the triggers include:
Using
pg_stat_statements
postgresql extension with the following query:I was able to identify the top slowest queries which you can find in this google sheet linked HERE
Screenshot for quick glance:
Just in case this is needed, I've included the database details:
Environment
I'm using AWS:
services: kestra-webserver: image: kestra/kestra:v0.19.9 deploy: replicas: 1 restart_policy: condition: on-failure max_attempts: 30 env_file: &env
shared-overlay-network
kestra-worker-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-executor-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-worker-2: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-executor-2: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-worker-3: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server worker --thread=256 volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-executor-3: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server executor volumes: volumes environment: *configuration depends_on:
shared-overlay-network
kestra-scheduler-1: image: kestra/kestra:v0.19.9 deploy: replicas: 2 restart_policy: condition: on-failure max_attempts: 30 env_file: env user: "root" command: server scheduler volumes: volumes environment: *configuration depends_on:
networks: shared-overlay-network: external: true name: shared-overlay-network