Closed kfox1111 closed 3 weeks ago
Server connectivity metrics might be hard to come by. For example in HA deployments, the agent maintains a subconn to each server. You might have connectivity to some, but not others.
Is the ultimate goal here to know whether or not JWTs can be minted? That might require a different approach since server connectivity is just one piece of the puzzle.
I think knowing if JWTs can be minted is a big one, yeah. So would be if certs aren't able to be refreshed soon due to the upstream connectivity being offline. So, if its easier to implement those things as different metrics, that would work too.
In the general category of:
As a sysadmin, when do I need to act, when the system is "broken" enough that it cant fix itself. Hopefully before it breaks rather then after.
Is the agent sync metric a good approximation of connectivity? We attempt a sync every 5 seconds.
This one:
| Call Counter | `manager`, `sync`, `fetch_entries_updates` | | The Sync Manager is fetching entries updates. |
I think so. But I'm not seeing that in the prom metrics:
$ curl 10.244.0.10:9988 -s | grep fetch
# HELP spire_server_datastore_bundle_fetch spire_server_datastore_bundle_fetch
# TYPE spire_server_datastore_bundle_fetch counter
spire_server_datastore_bundle_fetch{host="spire-server-0",status="OK"} 29
# HELP spire_server_datastore_bundle_fetch_elapsed_time spire_server_datastore_bundle_fetch_elapsed_time
# TYPE spire_server_datastore_bundle_fetch_elapsed_time summary
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} 0.13499599695205688
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} 0.23601099848747253
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} 0.23601099848747253
spire_server_datastore_bundle_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 5.876314967870712
spire_server_datastore_bundle_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 29
# HELP spire_server_datastore_ca_journal_fetch spire_server_datastore_ca_journal_fetch
# TYPE spire_server_datastore_ca_journal_fetch counter
spire_server_datastore_ca_journal_fetch{host="spire-server-0",status="OK"} 2
# HELP spire_server_datastore_ca_journal_fetch_elapsed_time spire_server_datastore_ca_journal_fetch_elapsed_time
# TYPE spire_server_datastore_ca_journal_fetch_elapsed_time summary
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 0.2950599938631058
spire_server_datastore_ca_journal_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 2
# HELP spire_server_datastore_node_fetch spire_server_datastore_node_fetch
# TYPE spire_server_datastore_node_fetch counter
spire_server_datastore_node_fetch{host="spire-server-0",status="OK"} 15
# HELP spire_server_datastore_node_fetch_elapsed_time spire_server_datastore_node_fetch_elapsed_time
# TYPE spire_server_datastore_node_fetch_elapsed_time summary
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} 0.17950400710105896
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} 0.18874600529670715
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} 0.18874600529670715
spire_server_datastore_node_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 2.5586300119757652
spire_server_datastore_node_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 15
@kfox1111 those look like server metrics instead of agent metrics?
🤦
Ok, I see it:
$ curl 192.168.39.87:9988 -s | grep fetch_entries
# HELP spire_agent_manager_sync_fetch_entries_updates spire_agent_manager_sync_fetch_entries_updates
# TYPE spire_agent_manager_sync_fetch_entries_updates counter
spire_agent_manager_sync_fetch_entries_updates{host="minikube",status="OK"} 19
The idea being that number should change every minute or so?
Or does status change?
I think that the status="OK"
is what is telling that connectivity to the server is ok, and that could be used in this case.
There other status's that show up? Do they increment during failures?
There can be other status values. I'm pretty sure you'll see a unique counter per status.
Not really sure how to use the metric then. what kind of query would I run to determine lack of connectivity?
@kfox1111 and I talked offline and think this metric is likely sufficient but Kevin is going to confirm.
There was some concern that when status goes to some other state, it might drop the OK state records.
I've ran it for a while under various conditions, and it doesn't seem to drop status types after they are added unless the agent is restarted. and Prometheus has special handling for that case, so, I think this is a good solution.
Thanks!
There should be a metric available that states whether the agent currently has a connection to the server.
This way, alerts can be generated if the connection drops for too long. This state affects the ability to mint JWT's.