solo-io / solo-cop

Solo Communities of Practice
16 stars 9 forks source link

Revisit mgmt plane metrics in v2 #35

Open willowmck opened 2 years ago

willowmck commented 2 years ago

Issue 1959 being tracked for this.

willowmck commented 2 years ago

Use controller runtime graph as a performance metric?

willowmck commented 2 years ago

I just grabbed all the metrics from the management plane in 2.0.9 and grep'd on HELP.

# HELP cluster_manager_active_clusters 
# HELP cluster_manager_cds_update_attempt 
# HELP cluster_manager_cds_update_duration 
# HELP cluster_manager_cds_update_failure 
# HELP cluster_manager_cds_update_success 
# HELP cluster_manager_cds_update_time 
# HELP cluster_manager_cds_version 
# HELP cluster_manager_cluster_added 
# HELP cluster_manager_cluster_modified 
# HELP cluster_manager_cluster_removed 
# HELP cluster_manager_cluster_updated 
# HELP cluster_manager_update_out_of_merge_window 
# HELP cluster_manager_warming_clusters 
# HELP cluster_xds_grpc_circuit_breakers_default_cx_open 
# HELP cluster_xds_grpc_circuit_breakers_default_cx_pool_open 
# HELP cluster_xds_grpc_circuit_breakers_default_rq_open 
# HELP cluster_xds_grpc_circuit_breakers_default_rq_pending_open 
# HELP cluster_xds_grpc_circuit_breakers_high_cx_pool_open 
# HELP cluster_xds_grpc_default_total_match_count 
# HELP cluster_xds_grpc_http2_pending_send_bytes 
# HELP cluster_xds_grpc_http2_streams_active 
# HELP cluster_xds_grpc_internal_upstream_rq_200 
# HELP cluster_xds_grpc_internal_upstream_rq_2xx 
# HELP cluster_xds_grpc_internal_upstream_rq_completed 
# HELP cluster_xds_grpc_membership_change 
# HELP cluster_xds_grpc_membership_degraded 
# HELP cluster_xds_grpc_membership_excluded 
# HELP cluster_xds_grpc_membership_healthy 
# HELP cluster_xds_grpc_membership_total 
# HELP cluster_xds_grpc_upstream_cx_active 
# HELP cluster_xds_grpc_upstream_cx_connect_ms 
# HELP cluster_xds_grpc_upstream_cx_destroy 
# HELP cluster_xds_grpc_upstream_cx_destroy_local 
# HELP cluster_xds_grpc_upstream_cx_http2_total 
# HELP cluster_xds_grpc_upstream_cx_length_ms 
# HELP cluster_xds_grpc_upstream_cx_max_requests 
# HELP cluster_xds_grpc_upstream_cx_protocol_error 
# HELP cluster_xds_grpc_upstream_cx_rx_bytes_buffered 
# HELP cluster_xds_grpc_upstream_cx_rx_bytes_total 
# HELP cluster_xds_grpc_upstream_cx_total 
# HELP cluster_xds_grpc_upstream_cx_tx_bytes_total 
# HELP cluster_xds_grpc_upstream_rq_200 
# HELP cluster_xds_grpc_upstream_rq_2xx 
# HELP cluster_xds_grpc_upstream_rq_active 
# HELP cluster_xds_grpc_upstream_rq_completed 
# HELP cluster_xds_grpc_upstream_rq_pending_active 
# HELP cluster_xds_grpc_upstream_rq_pending_total 
# HELP cluster_xds_grpc_upstream_rq_total 
# HELP component_proxy_tag_1_13_4_solo__istio_build 
# HELP controller_runtime_active_workers Number of currently used workers per controller
# HELP controller_runtime_max_concurrent_reconciles Maximum number of concurrent reconciles per controller
# HELP controller_runtime_reconcile_errors_total Total number of reconciliation errors per controller
# HELP controller_runtime_reconcile_time_seconds Length of time per reconciliation per controller
# HELP controller_runtime_reconcile_total Total number of reconciliations per controller
# HELP gloo_mesh_reconciler_time_sec how long the reconciler takes in seconds
# HELP gloo_mesh_redis_sync_err Number of times redis has failed to read
# HELP gloo_mesh_snapshot_upserter_op_time_sec how long a snapshot upserter operation takes to upsert in seconds
# HELP gloo_mesh_translation_time_sec how long a context translation takes in seconds
# HELP gloo_mesh_translator_concurrency The number of concurrent translations being run by Gloo Mesh
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# HELP go_goroutines Number of goroutines that currently exist.
# HELP go_info Information about the Go environment.
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# HELP go_memstats_frees_total Total number of frees.
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# HELP go_memstats_heap_objects Number of allocated objects.
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# HELP go_memstats_lookups_total Total number of pointer lookups.
# HELP go_memstats_mallocs_total Total number of mallocs.
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# HELP go_threads Number of OS threads created.
# HELP http_inbound_0_0_0_0_9080_rbac_allowed 
# HELP http_inbound_0_0_0_0_9080_rbac_denied 
# HELP istio_request_bytes 
# HELP istio_request_duration_milliseconds 
# HELP istio_requests_total 
# HELP istio_response_bytes 
# HELP istio_tcp_connections_closed_total 
# HELP istio_tcp_connections_opened_total 
# HELP istio_tcp_received_bytes_total 
# HELP istio_tcp_sent_bytes_total 
# HELP listener_manager_lds_update_attempt 
# HELP listener_manager_lds_update_duration 
# HELP listener_manager_lds_update_failure 
# HELP listener_manager_lds_update_success 
# HELP listener_manager_lds_update_time 
# HELP listener_manager_lds_version 
# HELP listener_manager_listener_added 
# HELP listener_manager_listener_create_success 
# HELP listener_manager_listener_in_place_updated 
# HELP listener_manager_listener_modified 
# HELP listener_manager_listener_removed 
# HELP listener_manager_total_filter_chains_draining 
# HELP listener_manager_total_listeners_active 
# HELP listener_manager_total_listeners_draining 
# HELP listener_manager_total_listeners_warming 
# HELP listener_manager_workers_started 
# HELP objects_synced_total Total number of successful object writes to storage. result indicates the result of the write, i.e. created, updated, unchanged
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# HELP process_max_fds Maximum number of open file descriptors.
# HELP process_open_fds Number of open file descriptors.
# HELP process_resident_memory_bytes Resident memory size in bytes.
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# HELP relay_pull_clients_connected Current number of connected Relay pull clients (Relay Agents).
# HELP relay_push_clients_connected Current number of connected Relay push clients (Relay Agents).
# HELP relay_push_clients_warmed Current number of warmed Relay push clients (Relay Agents).
# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
# HELP server_concurrency 
# HELP server_days_until_first_cert_expiring 
# HELP server_dynamic_unknown_fields 
# HELP server_hot_restart_epoch 
# HELP server_initialization_time_ms 
# HELP server_live 
# HELP server_main_thread_watchdog_mega_miss 
# HELP server_main_thread_watchdog_miss 
# HELP server_memory_allocated 
# HELP server_memory_heap_size 
# HELP server_memory_physical_size 
# HELP server_parent_connections 
# HELP server_state 
# HELP server_static_unknown_fields 
# HELP server_stats_recent_lookups 
# HELP server_total_connections 
# HELP server_uptime 
# HELP server_version 
# HELP server_wip_protos 
# HELP server_worker_0_watchdog_mega_miss 
# HELP server_worker_0_watchdog_miss 
# HELP server_worker_1_watchdog_mega_miss 
# HELP server_worker_1_watchdog_miss 
# HELP server_worker_2_watchdog_mega_miss 
# HELP server_worker_2_watchdog_miss 
# HELP server_worker_3_watchdog_miss 
# HELP server_worker_4_watchdog_mega_miss 
# HELP server_worker_4_watchdog_miss 
# HELP server_worker_5_watchdog_mega_miss 
# HELP server_worker_5_watchdog_miss 
# HELP server_worker_6_watchdog_mega_miss 
# HELP server_worker_6_watchdog_miss 
# HELP server_worker_7_watchdog_mega_miss 
# HELP server_worker_7_watchdog_miss 
# HELP wasm_envoy_wasm_runtime_null_active 
# HELP wasm_envoy_wasm_runtime_null_created 
# HELP wasm_filter_stats_filter_cache_hit_metric_cache_count 
# HELP wasm_filter_stats_filter_cache_miss_metric_cache_count 
# HELP workqueue_adds_total Total number of adds handled by workqueue
# HELP workqueue_depth Current depth of workqueue
# HELP workqueue_longest_running_processor_seconds How many seconds has the longest running processor for workqueue been running.
# HELP workqueue_queue_duration_seconds How long in seconds an item stays in workqueue before being requested
# HELP workqueue_retries_total Total number of retries handled by workqueue
# HELP workqueue_unfinished_work_seconds How many seconds of work has been done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
# HELP workqueue_work_duration_seconds How long in seconds processing an item from workqueue takes.