kestra-io / kestra

Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
https://kestra.io
Apache License 2.0
7.14k stars 427 forks source link

Provide metrics providing better overview of cluster processing ability #3421

Open yuri1969 opened 3 months ago

yuri1969 commented 3 months ago

Feature description

The main motivation is to get a status of the cluster from execution processing PoV.

This could possibly be achieved via a set of metrics providing info like:


Enabling such metrics is not straightforward as per discussion with @loicmathieu.

loicmathieu commented 3 months ago

You can already access the thread pool size, for ex using the JDBC runner you will have these information:

# HELP executor_active_threads The approximate number of threads that are actively executing tasks
# TYPE executor_active_threads gauge
executor_active_threads{name="jdbc-queue-LogEntry",} 0.0
executor_active_threads{name="io",} 0.0
executor_active_threads{name="scheduled",} 0.0
executor_active_threads{name="jdbc-queue-WorkerTriggerResult",} 1.0
executor_active_threads{name="jdbc-queue-ExecutionKilled",} 2.0
executor_active_threads{name="jdbc-queue-Trigger",} 0.0
executor_active_threads{name="jdbc-queue-Execution",} 2.0
executor_active_threads{name="jdbc-queue-WorkerJob",} 1.0
executor_active_threads{name="jdbc-queue-WorkerTaskResult",} 1.0
executor_active_threads{name="blocking",} 0.0
executor_active_threads{name="jdbc-queue-MetricEntry",} 0.0
executor_active_threads{name="jdbc-queue-Flow",} 2.0
executor_active_threads{name="jdbc-queue-SubflowExecutionResult",} 1.0
executor_active_threads{name="standalone-runner",} 0.0
executor_active_threads{name="worker",} 0.0
# HELP executor_pool_max_threads The maximum allowed number of threads in the pool
# TYPE executor_pool_max_threads gauge
executor_pool_max_threads{name="jdbc-queue-LogEntry",} 2.147483647E9
executor_pool_max_threads{name="io",} 2.147483647E9
executor_pool_max_threads{name="scheduled",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-WorkerTriggerResult",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-ExecutionKilled",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-Trigger",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-Execution",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-WorkerJob",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-WorkerTaskResult",} 2.147483647E9
executor_pool_max_threads{name="blocking",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-MetricEntry",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-Flow",} 2.147483647E9
executor_pool_max_threads{name="jdbc-queue-SubflowExecutionResult",} 2.147483647E9
executor_pool_max_threads{name="standalone-runner",} 2.147483647E9
executor_pool_max_threads{name="worker",} 128.0
# HELP executor_pool_core_threads The core number of threads for the pool
# TYPE executor_pool_core_threads gauge
executor_pool_core_threads{name="jdbc-queue-LogEntry",} 0.0
executor_pool_core_threads{name="io",} 0.0
executor_pool_core_threads{name="scheduled",} 1.0
executor_pool_core_threads{name="jdbc-queue-WorkerTriggerResult",} 0.0
executor_pool_core_threads{name="jdbc-queue-ExecutionKilled",} 0.0
executor_pool_core_threads{name="jdbc-queue-Trigger",} 0.0
executor_pool_core_threads{name="jdbc-queue-Execution",} 0.0
executor_pool_core_threads{name="jdbc-queue-WorkerJob",} 0.0
executor_pool_core_threads{name="jdbc-queue-WorkerTaskResult",} 0.0
executor_pool_core_threads{name="blocking",} 0.0
executor_pool_core_threads{name="jdbc-queue-MetricEntry",} 0.0
executor_pool_core_threads{name="jdbc-queue-Flow",} 0.0
executor_pool_core_threads{name="jdbc-queue-SubflowExecutionResult",} 0.0
executor_pool_core_threads{name="standalone-runner",} 0.0
executor_pool_core_threads{name="worker",} 128.0
# HELP executor_pool_size_threads The current number of threads in the pool
# TYPE executor_pool_size_threads gauge
executor_pool_size_threads{name="jdbc-queue-LogEntry",} 0.0
executor_pool_size_threads{name="io",} 0.0
executor_pool_size_threads{name="scheduled",} 1.0
executor_pool_size_threads{name="jdbc-queue-WorkerTriggerResult",} 1.0
executor_pool_size_threads{name="jdbc-queue-ExecutionKilled",} 2.0
executor_pool_size_threads{name="jdbc-queue-Trigger",} 0.0
executor_pool_size_threads{name="jdbc-queue-Execution",} 2.0
executor_pool_size_threads{name="jdbc-queue-WorkerJob",} 1.0
executor_pool_size_threads{name="jdbc-queue-WorkerTaskResult",} 1.0
executor_pool_size_threads{name="blocking",} 0.0
executor_pool_size_threads{name="jdbc-queue-MetricEntry",} 0.0
executor_pool_size_threads{name="jdbc-queue-Flow",} 2.0
executor_pool_size_threads{name="jdbc-queue-SubflowExecutionResult",} 1.0
executor_pool_size_threads{name="standalone-runner",} 2.0
executor_pool_size_threads{name="worker",} 0.0

So to monitor the worker usage an decided to scale up or down you can use the following metrics (and configure a Kubernetes horizontal pod autoscaling for ex):

executor_active_threads{name="worker",} 0.0
executor_pool_max_threads{name="worker",} 128.0
loicmathieu commented 3 months ago

To provide metrics for RUNNING/CREATED executions we would need to query the database each time the metrics endpoint is called by the external metric system. Depending on the metric system configuration and the database load it can be costly.

I'm not against providing it but in this case I would disable it by default.

loicmathieu commented 1 week ago

New worker metrics to allow autoscalling has been provided here: https://github.com/kestra-io/kestra/pull/4165

For autoscalling of executors, we may need to think a little more as there is no other way I can see to count CREATED executions than to query the database which is not a great idea for metrics.