terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

PromMetrics exports execution and controller metrics for old/stopped executions #3743

Closed briend closed 2 weeks ago

briend commented 2 months ago

After setting teraslice v2.1.0 to use the new internal prom metrics, the number of metrics exported can (seemingly) grow without bounds. Old executions for the same job_id will accumulate, whereas previously only (apparently) non-terminal statuses like running were exported. If old executions are eventually removed automatically, maybe this new behavior makes sense, but I'm not sure if they are or if there are plans for that.

terafoundation:
  prom_metrics_enabled: true

Only one of these is in state running and the other two are stopped; previously only 1 series was exported and now there are 3. Same for teraslice_execution_status, teraslice_controller_.*, etc.

teraslice_execution_info{assignment="master", container="teraslice", ex_id="99d401f2-9d11-4ce9-8882-fbfc246ba2f3", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="92f53103-7de6-4bb3-929b-e2bb13598502", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"}  1

teraslice_execution_info{assignment="master", container="teraslice", ex_id="a837cda0-7801-43e0-96e5-47ebfd3f1303", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="92f53103-7de6-4bb3-929b-e2bb13598502", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"} 1

teraslice_execution_info{assignment="master", container="teraslice", ex_id="undefined", image="terascope/teraslice:v2.1.0-nodev18.19.1", instance="node1", job="teraslice-ops2", job_id="undefined", name="teraslice-ops2", namespace="ts-ops2", pod="teraslice-ops2-master-84d7d64f7b-5n9sj", prometheus="ops/mon-prometheus", service="teraslice-ops2", version="2.1.0"}  1

Also note one of these has job_id="undefined" and ex_id="undefined", which may be an additional bug, since the ex that show up with the normal /txt/ex look normal:

curl -Ss ts-ops2/txt/ex
name             lifecycle   slicers  workers  _status  ex_id                                 job_id                                _created                  _updated                
---------------  ----------  -------  -------  -------  ------------------------------------  ------------------------------------  ------------------------  ------------------------
datagen-noop-v1  persistent  1        1        running  a837cda0-7801-43e0-96e5-47ebfd3f1303  92f53103-7de6-4bb3-929b-e2bb13598502  2024-09-06T22:33:13.407Z  2024-09-06T22:33:35.682Z
datagen-noop-v1  persistent  1        1        stopped  99d401f2-9d11-4ce9-8882-fbfc246ba2f3  92f53103-7de6-4bb3-929b-e2bb13598502  2024-07-25T20:54:42.083Z  2024-09-06T22:33:04.632Z
datagen-noop-v1  persistent  1        1        stopped  c2faa2f4-9148-4cf8-b97f-7bda77db8e9b  92f53103-7de6-4bb3-929b-e2bb13598502  2024-07-25T20:40:27.951Z  2024-07-25T20:54:29.218Z
godber commented 2 months ago

This should be fixed in the v2.3.2 release.

godber commented 1 month ago

@briend if this is resolved to your satisfaction please report back and close this issue.

busma13 commented 1 month ago

I looked into the teraslice_execution_info metric with undefined job_id and ex_id. The problem was that cluster/state included the master pod as well as exc and worker pods. I am making a change to filter out pods without ex_ids.

busma13 commented 1 month ago

ts-dev1 prom metrics after starting and stopping job multiple times. There are no duplicates anymore. Also the master pod is no longer being shown. I have also fixed the url label and removed the assignment label.

# HELP teraslice_master_info Information about Teraslice cluster master
# TYPE teraslice_master_info gauge
teraslice_master_info{arch="x64",clustering_type="kubernetesV2",name="teraslice-dev1",node_version="v22.9.0",platform="linux",teraslice_version="2.6.0",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_slices_processed Total slices processed across the cluster
# TYPE teraslice_slices_processed gauge
teraslice_slices_processed{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1457

# HELP teraslice_slices_failed Total slices failed across the cluster
# TYPE teraslice_slices_failed gauge
teraslice_slices_failed{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_slices_queued Total slices queued across the cluster
# TYPE teraslice_slices_queued gauge
teraslice_slices_queued{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_workers_joined Total workers joined across the cluster
# TYPE teraslice_workers_joined gauge
teraslice_workers_joined{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_workers_disconnected Total workers disconnected across the cluster
# TYPE teraslice_workers_disconnected gauge
teraslice_workers_disconnected{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_workers_reconnected Total workers reconnected across the cluster
# TYPE teraslice_workers_reconnected gauge
teraslice_workers_reconnected{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_controller_workers_active Number of Teraslice workers actively processing slices.
# TYPE teraslice_controller_workers_active gauge
teraslice_controller_workers_active{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_controller_workers_available Number of Teraslice workers running and waiting for work.
# TYPE teraslice_controller_workers_available gauge
teraslice_controller_workers_available{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_controller_workers_joined Total number of Teraslice workers that have joined the execution controller for this job.
# TYPE teraslice_controller_workers_joined gauge
teraslice_controller_workers_joined{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_controller_workers_reconnected Total number of Teraslice workers that have reconnected to the execution controller for this job.
# TYPE teraslice_controller_workers_reconnected gauge
teraslice_controller_workers_reconnected{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_controller_workers_disconnected Total number of Teraslice workers that have disconnected from execution controller for this job.
# TYPE teraslice_controller_workers_disconnected gauge
teraslice_controller_workers_disconnected{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_execution_info Information about Teraslice execution.
# TYPE teraslice_execution_info gauge
teraslice_execution_info{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",image="harbor.tera4.lan/dev/terascope/teraslice:2.6.0-node22.9.0-testMetrics1",version="2.6.0",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_controller_slicers_count Number of execution controllers (slicers) running for this execution.
# TYPE teraslice_controller_slicers_count gauge
teraslice_controller_slicers_count{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_execution_cpu_limit CPU core limit for a Teraslice worker container.
# TYPE teraslice_execution_cpu_limit gauge

# HELP teraslice_execution_cpu_request Requested number of CPU cores for a Teraslice worker container.
# TYPE teraslice_execution_cpu_request gauge

# HELP teraslice_execution_memory_limit Memory limit for Teraslice a worker container.
# TYPE teraslice_execution_memory_limit gauge

# HELP teraslice_execution_memory_request Requested amount of memory for a Teraslice worker container.
# TYPE teraslice_execution_memory_request gauge

# HELP teraslice_execution_status Current status of the Teraslice execution.
# TYPE teraslice_execution_status gauge
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="pending",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="scheduling",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="initializing",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="running",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="recovering",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="failing",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="paused",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="stopping",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="completed",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="stopped",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="rejected",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="failed",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="terminated",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_controller_slices_processed Number of slices processed.
# TYPE teraslice_controller_slices_processed gauge
teraslice_controller_slices_processed{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 35

# HELP teraslice_controller_slices_failed Number of slices failed.
# TYPE teraslice_controller_slices_failed gauge
teraslice_controller_slices_failed{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0

# HELP teraslice_controller_slices_queued Number of slices queued for processing.
# TYPE teraslice_controller_slices_queued gauge
teraslice_controller_slices_queued{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 3

# HELP teraslice_execution_created_timestamp_seconds Execution creation time.
# TYPE teraslice_execution_created_timestamp_seconds gauge
teraslice_execution_created_timestamp_seconds{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1729202268.65

# HELP teraslice_execution_updated_timestamp_seconds Execution update time.
# TYPE teraslice_execution_updated_timestamp_seconds gauge
teraslice_execution_updated_timestamp_seconds{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1729202307.951

# HELP teraslice_execution_slicers Number of slicers defined on the execution.
# TYPE teraslice_execution_slicers gauge
teraslice_execution_slicers{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1

# HELP teraslice_execution_workers Number of workers defined on the execution.  Note that the number of actual workers can differ from this value.
# TYPE teraslice_execution_workers gauge
teraslice_execution_workers{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
busma13 commented 1 month ago

@godber We added cluster-wide metrics like teraslice_slices_processed. Do these need any labels beyond the defaults?

briend commented 2 weeks ago

This looks good to me, so with v2.6.4 we will start disabling the old exporter and using the internal metrics server I think. The only thing to mention is I noticed labels like version are slightly different, not using v anymore nor including the node version, but I think that's probably intentional or at least cosmetic anyway.