agent metric for server connectivity - Githubissues

spiffe / spire

The SPIFFE Runtime Environment

https://spiffe.io

Apache License 2.0

1.8k stars 474 forks source link

agent metric for server connectivity #5435

Closed kfox1111 closed 3 weeks ago

kfox1111 commented 2 months ago

There should be a metric available that states whether the agent currently has a connection to the server.

This way, alerts can be generated if the connection drops for too long. This state affects the ability to mint JWT's.

azdagron commented 2 months ago

Server connectivity metrics might be hard to come by. For example in HA deployments, the agent maintains a subconn to each server. You might have connectivity to some, but not others.

Is the ultimate goal here to know whether or not JWTs can be minted? That might require a different approach since server connectivity is just one piece of the puzzle.

kfox1111 commented 2 months ago

I think knowing if JWTs can be minted is a big one, yeah. So would be if certs aren't able to be refreshed soon due to the upstream connectivity being offline. So, if its easier to implement those things as different metrics, that would work too.

In the general category of:

As a sysadmin, when do I need to act, when the system is "broken" enough that it cant fix itself. Hopefully before it breaks rather then after.

azdagron commented 1 month ago

Is the agent sync metric a good approximation of connectivity? We attempt a sync every 5 seconds.

azdagron commented 1 month ago

This one:

| Call Counter | `manager`, `sync`, `fetch_entries_updates`                               |                              | The Sync Manager is fetching entries updates.                                         |

kfox1111 commented 1 month ago

I think so. But I'm not seeing that in the prom metrics:

$ curl 10.244.0.10:9988 -s | grep fetch   
# HELP spire_server_datastore_bundle_fetch spire_server_datastore_bundle_fetch
# TYPE spire_server_datastore_bundle_fetch counter
spire_server_datastore_bundle_fetch{host="spire-server-0",status="OK"} 29
# HELP spire_server_datastore_bundle_fetch_elapsed_time spire_server_datastore_bundle_fetch_elapsed_time
# TYPE spire_server_datastore_bundle_fetch_elapsed_time summary
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} 0.13499599695205688
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} 0.23601099848747253
spire_server_datastore_bundle_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} 0.23601099848747253
spire_server_datastore_bundle_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 5.876314967870712
spire_server_datastore_bundle_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 29
# HELP spire_server_datastore_ca_journal_fetch spire_server_datastore_ca_journal_fetch
# TYPE spire_server_datastore_ca_journal_fetch counter
spire_server_datastore_ca_journal_fetch{host="spire-server-0",status="OK"} 2
# HELP spire_server_datastore_ca_journal_fetch_elapsed_time spire_server_datastore_ca_journal_fetch_elapsed_time
# TYPE spire_server_datastore_ca_journal_fetch_elapsed_time summary
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} NaN
spire_server_datastore_ca_journal_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 0.2950599938631058
spire_server_datastore_ca_journal_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 2
# HELP spire_server_datastore_node_fetch spire_server_datastore_node_fetch
# TYPE spire_server_datastore_node_fetch counter
spire_server_datastore_node_fetch{host="spire-server-0",status="OK"} 15
# HELP spire_server_datastore_node_fetch_elapsed_time spire_server_datastore_node_fetch_elapsed_time
# TYPE spire_server_datastore_node_fetch_elapsed_time summary
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.5"} 0.17950400710105896
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.9"} 0.18874600529670715
spire_server_datastore_node_fetch_elapsed_time{host="spire-server-0",status="OK",quantile="0.99"} 0.18874600529670715
spire_server_datastore_node_fetch_elapsed_time_sum{host="spire-server-0",status="OK"} 2.5586300119757652
spire_server_datastore_node_fetch_elapsed_time_count{host="spire-server-0",status="OK"} 15

amartinezfayo commented 1 month ago

@kfox1111 those look like server metrics instead of agent metrics?

kfox1111 commented 1 month ago

🤦

Ok, I see it:

$ curl 192.168.39.87:9988 -s | grep fetch_entries
# HELP spire_agent_manager_sync_fetch_entries_updates spire_agent_manager_sync_fetch_entries_updates
# TYPE spire_agent_manager_sync_fetch_entries_updates counter
spire_agent_manager_sync_fetch_entries_updates{host="minikube",status="OK"} 19

The idea being that number should change every minute or so?

kfox1111 commented 1 month ago

Or does status change?

amartinezfayo commented 1 month ago

I think that the status="OK" is what is telling that connectivity to the server is ok, and that could be used in this case.

kfox1111 commented 1 month ago

There other status's that show up? Do they increment during failures?

azdagron commented 1 month ago

There can be other status values. I'm pretty sure you'll see a unique counter per status.

kfox1111 commented 1 month ago

Not really sure how to use the metric then. what kind of query would I run to determine lack of connectivity?

azdagron commented 1 month ago

@kfox1111 and I talked offline and think this metric is likely sufficient but Kevin is going to confirm.

kfox1111 commented 3 weeks ago

There was some concern that when status goes to some other state, it might drop the OK state records.

I've ran it for a while under various conditions, and it doesn't seem to drop status types after they are added unless the agent is restarted. and Prometheus has special handling for that case, so, I think this is a good solution.

Thanks!