Closed briend closed 2 weeks ago
This should be fixed in the v2.3.2
release.
@briend if this is resolved to your satisfaction please report back and close this issue.
I looked into the teraslice_execution_info
metric with undefined job_id
and ex_id
. The problem was that cluster/state
included the master pod as well as exc and worker pods. I am making a change to filter out pods without ex_id
s.
ts-dev1 prom metrics after starting and stopping job multiple times. There are no duplicates anymore. Also the master pod is no longer being shown. I have also fixed the url label and removed the assignment label.
# HELP teraslice_master_info Information about Teraslice cluster master
# TYPE teraslice_master_info gauge
teraslice_master_info{arch="x64",clustering_type="kubernetesV2",name="teraslice-dev1",node_version="v22.9.0",platform="linux",teraslice_version="2.6.0",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_slices_processed Total slices processed across the cluster
# TYPE teraslice_slices_processed gauge
teraslice_slices_processed{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1457
# HELP teraslice_slices_failed Total slices failed across the cluster
# TYPE teraslice_slices_failed gauge
teraslice_slices_failed{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_slices_queued Total slices queued across the cluster
# TYPE teraslice_slices_queued gauge
teraslice_slices_queued{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_workers_joined Total workers joined across the cluster
# TYPE teraslice_workers_joined gauge
teraslice_workers_joined{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_workers_disconnected Total workers disconnected across the cluster
# TYPE teraslice_workers_disconnected gauge
teraslice_workers_disconnected{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_workers_reconnected Total workers reconnected across the cluster
# TYPE teraslice_workers_reconnected gauge
teraslice_workers_reconnected{name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_controller_workers_active Number of Teraslice workers actively processing slices.
# TYPE teraslice_controller_workers_active gauge
teraslice_controller_workers_active{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_controller_workers_available Number of Teraslice workers running and waiting for work.
# TYPE teraslice_controller_workers_available gauge
teraslice_controller_workers_available{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_controller_workers_joined Total number of Teraslice workers that have joined the execution controller for this job.
# TYPE teraslice_controller_workers_joined gauge
teraslice_controller_workers_joined{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_controller_workers_reconnected Total number of Teraslice workers that have reconnected to the execution controller for this job.
# TYPE teraslice_controller_workers_reconnected gauge
teraslice_controller_workers_reconnected{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_controller_workers_disconnected Total number of Teraslice workers that have disconnected from execution controller for this job.
# TYPE teraslice_controller_workers_disconnected gauge
teraslice_controller_workers_disconnected{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_execution_info Information about Teraslice execution.
# TYPE teraslice_execution_info gauge
teraslice_execution_info{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",image="harbor.tera4.lan/dev/terascope/teraslice:2.6.0-node22.9.0-testMetrics1",version="2.6.0",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_controller_slicers_count Number of execution controllers (slicers) running for this execution.
# TYPE teraslice_controller_slicers_count gauge
teraslice_controller_slicers_count{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_execution_cpu_limit CPU core limit for a Teraslice worker container.
# TYPE teraslice_execution_cpu_limit gauge
# HELP teraslice_execution_cpu_request Requested number of CPU cores for a Teraslice worker container.
# TYPE teraslice_execution_cpu_request gauge
# HELP teraslice_execution_memory_limit Memory limit for Teraslice a worker container.
# TYPE teraslice_execution_memory_limit gauge
# HELP teraslice_execution_memory_request Requested amount of memory for a Teraslice worker container.
# TYPE teraslice_execution_memory_request gauge
# HELP teraslice_execution_status Current status of the Teraslice execution.
# TYPE teraslice_execution_status gauge
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="pending",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="scheduling",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="initializing",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="running",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="recovering",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="failing",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="paused",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="stopping",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="completed",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="stopped",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="rejected",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="failed",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
teraslice_execution_status{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",status="terminated",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_controller_slices_processed Number of slices processed.
# TYPE teraslice_controller_slices_processed gauge
teraslice_controller_slices_processed{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 35
# HELP teraslice_controller_slices_failed Number of slices failed.
# TYPE teraslice_controller_slices_failed gauge
teraslice_controller_slices_failed{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 0
# HELP teraslice_controller_slices_queued Number of slices queued for processing.
# TYPE teraslice_controller_slices_queued gauge
teraslice_controller_slices_queued{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 3
# HELP teraslice_execution_created_timestamp_seconds Execution creation time.
# TYPE teraslice_execution_created_timestamp_seconds gauge
teraslice_execution_created_timestamp_seconds{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1729202268.65
# HELP teraslice_execution_updated_timestamp_seconds Execution update time.
# TYPE teraslice_execution_updated_timestamp_seconds gauge
teraslice_execution_updated_timestamp_seconds{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1729202307.951
# HELP teraslice_execution_slicers Number of slicers defined on the execution.
# TYPE teraslice_execution_slicers gauge
teraslice_execution_slicers{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
# HELP teraslice_execution_workers Number of workers defined on the execution. Note that the number of actual workers can differ from this value.
# TYPE teraslice_execution_workers gauge
teraslice_execution_workers{ex_id="7b274f71-df08-45c9-8bd6-5e6a2f657aa4",job_id="da2d0a46-da70-450a-af21-d9120f145701",job_name="peter-datagen-to-noop",name="teraslice-dev1",url="https://ts-dev1.tera4.lan"} 1
@godber We added cluster-wide metrics like teraslice_slices_processed
. Do these need any labels beyond the defaults?
This looks good to me, so with v2.6.4
we will start disabling the old exporter and using the internal metrics server I think. The only thing to mention is I noticed labels like version
are slightly different, not using v
anymore nor including the node version, but I think that's probably intentional or at least cosmetic anyway.
After setting teraslice
v2.1.0
to use the new internal prom metrics, the number of metrics exported can (seemingly) grow without bounds. Old executions for the samejob_id
will accumulate, whereas previously only (apparently) non-terminal statuses likerunning
were exported. If old executions are eventually removed automatically, maybe this new behavior makes sense, but I'm not sure if they are or if there are plans for that.Only one of these is in state
running
and the other two arestopped
; previously only1
series was exported and now there are3
. Same forteraslice_execution_status
,teraslice_controller_.*
, etc.Also note one of these has
job_id="undefined"
andex_id="undefined"
, which may be an additional bug, since the ex that show up with the normal/txt/ex
look normal: