grafana / agent

Vendor-neutral programmable observability pipelines.
https://grafana.com/docs/agent/
Apache License 2.0
1.6k stars 487 forks source link

error: skipping update of position for a file which does not currently exist #3885

Closed amseager closed 1 year ago

amseager commented 1 year ago

What's wrong?

I have agent+loki+grafana setup, and the agent is configured to tail the log files of my apps and then push them to loki. If you start all of this, after some time this message starts appearing in the agent logs (every 10 sec or smth):

ts=2023-05-15T14:07:33.321533823Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1

Eventually, you'll get more and more errors like this at once, so it becomes almost impossible to check the agent logs.

I found some issues that seem to be related: https://github.com/grafana/loki/issues/3108 https://github.com/grafana/loki/issues/3985

My services indeed roll their log files after reaching the max file size, although I don't see any problems with "losing" logs or smth like that (loki stores everything in its storage). I'm also not sure about performance problems, maybe I'll face them later.

Is it critical? I see that there was this PR, maybe it needs to be merged here as well?

Steps to reproduce

  1. Use any app that has log retention (e.g. java app with Spring Boot + logback config)
  2. deploy agent+loki
  3. check the agent logs after some time (when logs are rolled)

I'm not 100% sure that having log retention is necessary to reproduce it though.

System information

No response

Software version

0.31.3

Configuration

No response

Logs

ts=2023-05-15T14:07:33.297587958Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.309127591Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.309148734Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.309322049Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.309168115Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.31414347Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314134356Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314196232Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.31421379Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314342427Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314356711Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314508679Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.31452438Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314451658Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314467108Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314539858Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314563993Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314608935Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314641607Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.314369259Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.316897402Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.31710958Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.318238841Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.319364763Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
ts=2023-05-15T14:07:33.321533823Z caller=tailer.go:202 level=info component=logs logs_config=default component=tailer msg="skipping update of position for a file which does not currently exist" path=/tmp/logs/my-service/my-service-1
rfratto commented 1 year ago

@thampiotr is this something you'd be able to look into?

thampiotr commented 1 year ago

Sure, will take a look!

thampiotr commented 1 year ago

Hi @amseager - I've tried to reproduce this issue, but so far no luck. I can still try a few options a bit later.

In the meantime, could you share the configuration that you are using for this log file (whether .yaml in static mode or .river in the flow mode).

Also, could you share the metrics that the agent reports (e.g. curl localhost:12345/metrics) if that's possible? Particularly interested in metrics in promtail_ or loki_source_file_ namespaces.

amseager commented 1 year ago

@thampiotr Hi, sure.

Loki part in grafana-agent config:

logs:
  configs:
  - name: default
    clients:
      - url: ${LOKI_URL}/loki/api/v1/push
    positions:
      filename: /tmp/agent/positions.yaml
    scrape_configs:
    - job_name: default
      static_configs:
      - targets: [localhost]
        labels:
          job: default
          __path__: /tmp/logs/*/*
      pipeline_stages:
      - regex:
          expression: '^\d+ \[.+?\] \[(?P<application_name>.+?),.+?\] .*? \[.+?\] (?P<log_level>\w+?) .*$'
      - labels:
          application_name:
          log_level:

Agent metrics:

Expand ``` # HELP agent_build_info A metric with a constant '1' value labeled by version, revision, branch, goversion from which agent was built, and the goos and goarch for the build. # TYPE agent_build_info gauge agent_build_info{branch="HEAD",goarch="amd64",goos="linux",goversion="go1.19.4",revision="996155fd",version="v0.31.3"} 1 # HELP agent_config_hash Hash of the currently active config file. # TYPE agent_config_hash gauge agent_config_hash{sha256="1e4064282c315864737a8a8c3d0a7826d0e57415e3d73062b004cfcc0c49cf54"} 1 # HELP agent_config_last_load_success_timestamp_seconds Timestamp of the last successful configuration load. # TYPE agent_config_last_load_success_timestamp_seconds gauge agent_config_last_load_success_timestamp_seconds 1.6843076034632854e+09 # HELP agent_config_last_load_successful Config loaded successfully. # TYPE agent_config_last_load_successful gauge agent_config_last_load_successful 1 # HELP agent_config_load_failures_total Configuration load failures. # TYPE agent_config_load_failures_total counter agent_config_load_failures_total 0 # HELP agent_inflight_requests Current number of inflight requests. # TYPE agent_inflight_requests gauge agent_inflight_requests{method="GET",route="metrics"} 1 # HELP agent_metrics_active_configs Current number of active configs being used by the agent. # TYPE agent_metrics_active_configs gauge agent_metrics_active_configs 1 # HELP agent_metrics_active_instances Current number of active instances being used by the agent. # TYPE agent_metrics_active_instances gauge agent_metrics_active_instances 1 # HELP agent_metrics_cleaner_abandoned_storage Number of storage directories not associated with any managed instance # TYPE agent_metrics_cleaner_abandoned_storage gauge agent_metrics_cleaner_abandoned_storage 0 # HELP agent_metrics_cleaner_cleanup_seconds Time spent performing each periodic WAL cleanup # TYPE agent_metrics_cleaner_cleanup_seconds histogram agent_metrics_cleaner_cleanup_seconds_bucket{le="0.005"} 0 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.01"} 0 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.025"} 13 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.05"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.1"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.25"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="0.5"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="1"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="2.5"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="5"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="10"} 14 agent_metrics_cleaner_cleanup_seconds_bucket{le="+Inf"} 14 agent_metrics_cleaner_cleanup_seconds_sum 0.241731946 agent_metrics_cleaner_cleanup_seconds_count 14 # HELP agent_metrics_cleaner_errors_total Number of errors removing abandoned WALs # TYPE agent_metrics_cleaner_errors_total counter agent_metrics_cleaner_errors_total 0 # HELP agent_metrics_cleaner_managed_storage Number of storage directories associated with managed instances # TYPE agent_metrics_cleaner_managed_storage gauge agent_metrics_cleaner_managed_storage 1 # HELP agent_metrics_cleaner_success_total Number of successfully removed abandoned WALs # TYPE agent_metrics_cleaner_success_total counter agent_metrics_cleaner_success_total 0 # HELP agent_metrics_configs_changed_total Total number of dynamically updated configs # TYPE agent_metrics_configs_changed_total gauge agent_metrics_configs_changed_total{event="created"} 1 # HELP agent_metrics_ha_configs_created_total Total number of created scraping service configs # TYPE agent_metrics_ha_configs_created_total counter agent_metrics_ha_configs_created_total 0 # HELP agent_metrics_ha_configs_deleted_total Total number of deleted scraping service configs # TYPE agent_metrics_ha_configs_deleted_total counter agent_metrics_ha_configs_deleted_total 0 # HELP agent_metrics_ha_configs_updated_total Total number of updated scraping service configs # TYPE agent_metrics_ha_configs_updated_total counter agent_metrics_ha_configs_updated_total 0 # HELP agent_tcp_connections Current number of accepted TCP connections. # TYPE agent_tcp_connections gauge agent_tcp_connections{protocol="grpc"} 0 agent_tcp_connections{protocol="http"} 1 # HELP agent_tcp_connections_limit The maximum number of TCP connections that can be accepted (0 = unlimited) # TYPE agent_tcp_connections_limit gauge agent_tcp_connections_limit{protocol="grpc"} 0 agent_tcp_connections_limit{protocol="http"} 0 # HELP agent_wal_exemplars_appended_total Total number of exemplars appended to the WAL # TYPE agent_wal_exemplars_appended_total counter agent_wal_exemplars_appended_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP agent_wal_samples_appended_total Total number of samples appended to the WAL # TYPE agent_wal_samples_appended_total counter agent_wal_samples_appended_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 1.2496445e+07 # HELP agent_wal_storage_active_series Current number of active series being tracked by the WAL storage # TYPE agent_wal_storage_active_series gauge agent_wal_storage_active_series{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 2368 # HELP agent_wal_storage_created_series_total Total number of created series appended to the WAL # TYPE agent_wal_storage_created_series_total counter agent_wal_storage_created_series_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 2545 # HELP agent_wal_storage_deleted_series Current number of series marked for deletion from memory # TYPE agent_wal_storage_deleted_series gauge agent_wal_storage_deleted_series{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP agent_wal_storage_removed_series_total Total number of created series removed from the WAL # TYPE agent_wal_storage_removed_series_total counter agent_wal_storage_removed_series_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 177 # HELP blackbox_exporter_config_last_reload_success_timestamp_seconds Timestamp of the last successful configuration reload. # TYPE blackbox_exporter_config_last_reload_success_timestamp_seconds gauge blackbox_exporter_config_last_reload_success_timestamp_seconds 0 # HELP blackbox_exporter_config_last_reload_successful Blackbox exporter config loaded successfully. # TYPE blackbox_exporter_config_last_reload_successful gauge blackbox_exporter_config_last_reload_successful 0 # HELP blackbox_module_unknown_total Count of unknown modules requested by probes # TYPE blackbox_module_unknown_total counter blackbox_module_unknown_total 0 # HELP cortex_experimental_features_in_use_total The number of experimental features in use. # TYPE cortex_experimental_features_in_use_total counter cortex_experimental_features_in_use_total 0 # HELP deprecated_flags_inuse_total The number of deprecated flags currently set. # TYPE deprecated_flags_inuse_total counter deprecated_flags_inuse_total 0 # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9496e-05 go_gc_duration_seconds{quantile="0.25"} 8.1555e-05 go_gc_duration_seconds{quantile="0.5"} 9.8929e-05 go_gc_duration_seconds{quantile="0.75"} 0.000135314 go_gc_duration_seconds{quantile="1"} 0.002775509 go_gc_duration_seconds_sum 0.083830092 go_gc_duration_seconds_count 443 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 550 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.19.4"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 5.1052112e+07 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 1.4858568448e+10 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.943415e+06 # HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 1.22698148e+08 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 1.3220784e+07 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 5.1052112e+07 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 6.6199552e+07 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 6.3496192e+07 # HELP go_memstats_heap_objects Number of allocated objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 212344 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 4.6686208e+07 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 1.29695744e+08 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 1.6843343121716583e+09 # HELP go_memstats_lookups_total Total number of pointer lookups. # TYPE go_memstats_lookups_total counter go_memstats_lookups_total 0 # HELP go_memstats_mallocs_total Total number of mallocs. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 1.22910492e+08 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 2400 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 15600 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 625824 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 1.2204e+06 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 8.098272e+07 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 721361 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 4.521984e+06 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 4.521984e+06 # HELP go_memstats_sys_bytes Number of bytes obtained from system. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 1.51339288e+08 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 11 # HELP log_messages_total Total number of log messages. # TYPE log_messages_total counter log_messages_total{level="debug"} 7925 log_messages_total{level="error"} 126 log_messages_total{level="info"} 184279 log_messages_total{level="warn"} 1 # HELP loki_experimental_features_in_use_total The number of experimental features in use. # TYPE loki_experimental_features_in_use_total counter loki_experimental_features_in_use_total 0 # HELP loki_logql_querystats_duplicates_total Total count of duplicates found while executing LogQL queries. # TYPE loki_logql_querystats_duplicates_total counter loki_logql_querystats_duplicates_total 0 # HELP loki_logql_querystats_ingester_sent_lines_total Total count of lines sent from ingesters while executing LogQL queries. # TYPE loki_logql_querystats_ingester_sent_lines_total counter loki_logql_querystats_ingester_sent_lines_total 0 # HELP loki_querier_index_cache_corruptions_total The number of cache corruptions for the index cache. # TYPE loki_querier_index_cache_corruptions_total counter loki_querier_index_cache_corruptions_total 0 # HELP loki_querier_index_cache_encode_errors_total The number of errors for the index cache while encoding the body. # TYPE loki_querier_index_cache_encode_errors_total counter loki_querier_index_cache_encode_errors_total 0 # HELP loki_querier_index_cache_gets_total The number of gets for the index cache. # TYPE loki_querier_index_cache_gets_total counter loki_querier_index_cache_gets_total 0 # HELP loki_querier_index_cache_hits_total The number of cache hits for the index cache. # TYPE loki_querier_index_cache_hits_total counter loki_querier_index_cache_hits_total 0 # HELP loki_querier_index_cache_puts_total The number of puts for the index cache. # TYPE loki_querier_index_cache_puts_total counter loki_querier_index_cache_puts_total 0 # HELP net_conntrack_dialer_conn_attempted_total Total number of connections attempted by the given dialer a given name. # TYPE net_conntrack_dialer_conn_attempted_total counter net_conntrack_dialer_conn_attempted_total{dialer_name="apps"} 953 net_conntrack_dialer_conn_attempted_total{dialer_name="promtail"} 1 net_conntrack_dialer_conn_attempted_total{dialer_name="remote_storage_write_client"} 1 # HELP net_conntrack_dialer_conn_closed_total Total number of connections closed which originated from the dialer of a given name. # TYPE net_conntrack_dialer_conn_closed_total counter net_conntrack_dialer_conn_closed_total{dialer_name="apps"} 803 net_conntrack_dialer_conn_closed_total{dialer_name="promtail"} 0 net_conntrack_dialer_conn_closed_total{dialer_name="remote_storage_write_client"} 0 # HELP net_conntrack_dialer_conn_established_total Total number of connections successfully established by the given dialer a given name. # TYPE net_conntrack_dialer_conn_established_total counter net_conntrack_dialer_conn_established_total{dialer_name="apps"} 818 net_conntrack_dialer_conn_established_total{dialer_name="promtail"} 1 net_conntrack_dialer_conn_established_total{dialer_name="remote_storage_write_client"} 1 # HELP net_conntrack_dialer_conn_failed_total Total number of connections failed to dial by the dialer a given name. # TYPE net_conntrack_dialer_conn_failed_total counter net_conntrack_dialer_conn_failed_total{dialer_name="apps",reason="refused"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="apps",reason="resolution"} 135 net_conntrack_dialer_conn_failed_total{dialer_name="apps",reason="timeout"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="apps",reason="unknown"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="promtail",reason="refused"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="promtail",reason="resolution"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="promtail",reason="timeout"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="promtail",reason="unknown"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="remote_storage_write_client",reason="refused"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="remote_storage_write_client",reason="resolution"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="remote_storage_write_client",reason="timeout"} 0 net_conntrack_dialer_conn_failed_total{dialer_name="remote_storage_write_client",reason="unknown"} 0 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 874.63 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1.048576e+06 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 73 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 1.92233472e+08 # HELP process_start_time_seconds Start time of the process since unix epoch in seconds. # TYPE process_start_time_seconds gauge process_start_time_seconds 1.68430760222e+09 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 1.666707456e+09 # HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. # TYPE process_virtual_memory_max_bytes gauge process_virtual_memory_max_bytes 1.8446744073709552e+19 # HELP prometheus_interner_num_strings The current number of interned strings # TYPE prometheus_interner_num_strings gauge prometheus_interner_num_strings 402 # HELP prometheus_interner_string_interner_zero_reference_releases_total The number of times release has been called for strings that are not interned. # TYPE prometheus_interner_string_interner_zero_reference_releases_total counter prometheus_interner_string_interner_zero_reference_releases_total 0 # HELP prometheus_remote_storage_bytes_total The total number of bytes of data (not metadata) sent by the queue after compression. Note that when exemplars over remote write is enabled the exemplars included in a remote write request count towards this metric. # TYPE prometheus_remote_storage_bytes_total counter prometheus_remote_storage_bytes_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 3.58517531e+08 # HELP prometheus_remote_storage_enqueue_retries_total Total number of times enqueue has failed because a shards queue was full. # TYPE prometheus_remote_storage_enqueue_retries_total counter prometheus_remote_storage_enqueue_retries_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_exemplars_dropped_total Total number of exemplars which were dropped after being read from the WAL before being sent via remote write, either via relabelling or unintentionally because of an unknown reference ID. # TYPE prometheus_remote_storage_exemplars_dropped_total counter prometheus_remote_storage_exemplars_dropped_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_exemplars_failed_total Total number of exemplars which failed on send to remote storage, non-recoverable errors. # TYPE prometheus_remote_storage_exemplars_failed_total counter prometheus_remote_storage_exemplars_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_exemplars_in_total Exemplars in to remote storage, compare to exemplars out for queue managers. # TYPE prometheus_remote_storage_exemplars_in_total counter prometheus_remote_storage_exemplars_in_total 0 # HELP prometheus_remote_storage_exemplars_pending The number of exemplars pending in the queues shards to be sent to the remote storage. # TYPE prometheus_remote_storage_exemplars_pending gauge prometheus_remote_storage_exemplars_pending{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_exemplars_retried_total Total number of exemplars which failed on send to remote storage but were retried because the send error was recoverable. # TYPE prometheus_remote_storage_exemplars_retried_total counter prometheus_remote_storage_exemplars_retried_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_exemplars_total Total number of exemplars sent to remote storage. # TYPE prometheus_remote_storage_exemplars_total counter prometheus_remote_storage_exemplars_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_highest_timestamp_in_seconds Highest timestamp that has come into the remote storage via the Appender interface, in seconds since epoch. # TYPE prometheus_remote_storage_highest_timestamp_in_seconds gauge prometheus_remote_storage_highest_timestamp_in_seconds{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 1.684334329e+09 # HELP prometheus_remote_storage_histograms_dropped_total Total number of histograms which were dropped after being read from the WAL before being sent via remote write, either via relabelling or unintentionally because of an unknown reference ID. # TYPE prometheus_remote_storage_histograms_dropped_total counter prometheus_remote_storage_histograms_dropped_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_histograms_failed_total Total number of histograms which failed on send to remote storage, non-recoverable errors. # TYPE prometheus_remote_storage_histograms_failed_total counter prometheus_remote_storage_histograms_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_histograms_in_total HistogramSamples in to remote storage, compare to histograms out for queue managers. # TYPE prometheus_remote_storage_histograms_in_total counter prometheus_remote_storage_histograms_in_total 0 # HELP prometheus_remote_storage_histograms_pending The number of histograms pending in the queues shards to be sent to the remote storage. # TYPE prometheus_remote_storage_histograms_pending gauge prometheus_remote_storage_histograms_pending{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_histograms_retried_total Total number of histograms which failed on send to remote storage but were retried because the send error was recoverable. # TYPE prometheus_remote_storage_histograms_retried_total counter prometheus_remote_storage_histograms_retried_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_histograms_total Total number of histograms sent to remote storage. # TYPE prometheus_remote_storage_histograms_total counter prometheus_remote_storage_histograms_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_max_samples_per_send The maximum number of samples to be sent, in a single request, to the remote storage. Note that, when sending of exemplars over remote write is enabled, exemplars count towards this limt. # TYPE prometheus_remote_storage_max_samples_per_send gauge prometheus_remote_storage_max_samples_per_send{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 500 # HELP prometheus_remote_storage_metadata_bytes_total The total number of bytes of metadata sent by the queue after compression. # TYPE prometheus_remote_storage_metadata_bytes_total counter prometheus_remote_storage_metadata_bytes_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 1.327263e+06 # HELP prometheus_remote_storage_metadata_failed_total Total number of metadata entries which failed on send to remote storage, non-recoverable errors. # TYPE prometheus_remote_storage_metadata_failed_total counter prometheus_remote_storage_metadata_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_metadata_retried_total Total number of metadata entries which failed on send to remote storage but were retried because the send error was recoverable. # TYPE prometheus_remote_storage_metadata_retried_total counter prometheus_remote_storage_metadata_retried_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_metadata_total Total number of metadata entries sent to remote storage. # TYPE prometheus_remote_storage_metadata_total counter prometheus_remote_storage_metadata_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 32040 # HELP prometheus_remote_storage_queue_highest_sent_timestamp_seconds Timestamp from a WAL sample, the highest timestamp successfully sent by this queue, in seconds since epoch. # TYPE prometheus_remote_storage_queue_highest_sent_timestamp_seconds gauge prometheus_remote_storage_queue_highest_sent_timestamp_seconds{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 1.684334329e+09 # HELP prometheus_remote_storage_samples_dropped_total Total number of samples which were dropped after being read from the WAL before being sent via remote write, either via relabelling or unintentionally because of an unknown reference ID. # TYPE prometheus_remote_storage_samples_dropped_total counter prometheus_remote_storage_samples_dropped_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_samples_failed_total Total number of samples which failed on send to remote storage, non-recoverable errors. # TYPE prometheus_remote_storage_samples_failed_total counter prometheus_remote_storage_samples_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_samples_in_total Samples in to remote storage, compare to samples out for queue managers. # TYPE prometheus_remote_storage_samples_in_total counter prometheus_remote_storage_samples_in_total 1.2496445e+07 # HELP prometheus_remote_storage_samples_pending The number of samples pending in the queues shards to be sent to the remote storage. # TYPE prometheus_remote_storage_samples_pending gauge prometheus_remote_storage_samples_pending{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 445 # HELP prometheus_remote_storage_samples_retried_total Total number of samples which failed on send to remote storage but were retried because the send error was recoverable. # TYPE prometheus_remote_storage_samples_retried_total counter prometheus_remote_storage_samples_retried_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0 # HELP prometheus_remote_storage_samples_total Total number of samples sent to remote storage. # TYPE prometheus_remote_storage_samples_total counter prometheus_remote_storage_samples_total{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 1.2496e+07 # HELP prometheus_remote_storage_sent_batch_duration_seconds Duration of send calls to the remote storage. # TYPE prometheus_remote_storage_sent_batch_duration_seconds histogram prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.005"} 18216 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.01"} 23424 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.025"} 25224 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.05"} 25424 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.1"} 25435 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.25"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="0.5"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="1"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="2.5"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="5"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="10"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="25"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="60"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="120"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="300"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_bucket{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write",le="+Inf"} 25437 prometheus_remote_storage_sent_batch_duration_seconds_sum{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 133.43102419299976 prometheus_remote_storage_sent_batch_duration_seconds_count{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 25437 # HELP prometheus_remote_storage_shard_capacity The capacity of each shard of the queue used for parallel sending to the remote storage. # TYPE prometheus_remote_storage_shard_capacity gauge prometheus_remote_storage_shard_capacity{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 2500 # HELP prometheus_remote_storage_shards The number of shards used for parallel sending to the remote storage. # TYPE prometheus_remote_storage_shards gauge prometheus_remote_storage_shards{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 1 # HELP prometheus_remote_storage_shards_desired The number of shards that the queues shard calculation wants to run based on the rate of samples in vs. samples out. # TYPE prometheus_remote_storage_shards_desired gauge prometheus_remote_storage_shards_desired{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 0.0052842736266835505 # HELP prometheus_remote_storage_shards_max The maximum number of shards that the queue is allowed to run. # TYPE prometheus_remote_storage_shards_max gauge prometheus_remote_storage_shards_max{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 200 # HELP prometheus_remote_storage_shards_min The minimum number of shards that the queue is allowed to run. # TYPE prometheus_remote_storage_shards_min gauge prometheus_remote_storage_shards_min{instance_group_name="0928abe60ab577f09352172478cfb1eb",remote_name="0928ab-4a5a87",url="http://monitoring_prometheus:9090/api/v1/write"} 1 # HELP prometheus_sd_azure_failures_total Number of Azure service discovery refresh failures. # TYPE prometheus_sd_azure_failures_total counter prometheus_sd_azure_failures_total 0 # HELP prometheus_sd_consul_rpc_duration_seconds The duration of a Consul RPC call in seconds. # TYPE prometheus_sd_consul_rpc_duration_seconds summary prometheus_sd_consul_rpc_duration_seconds{call="service",endpoint="catalog",quantile="0.5"} NaN prometheus_sd_consul_rpc_duration_seconds{call="service",endpoint="catalog",quantile="0.9"} NaN prometheus_sd_consul_rpc_duration_seconds{call="service",endpoint="catalog",quantile="0.99"} NaN prometheus_sd_consul_rpc_duration_seconds_sum{call="service",endpoint="catalog"} 0 prometheus_sd_consul_rpc_duration_seconds_count{call="service",endpoint="catalog"} 0 prometheus_sd_consul_rpc_duration_seconds{call="services",endpoint="catalog",quantile="0.5"} NaN prometheus_sd_consul_rpc_duration_seconds{call="services",endpoint="catalog",quantile="0.9"} NaN prometheus_sd_consul_rpc_duration_seconds{call="services",endpoint="catalog",quantile="0.99"} NaN prometheus_sd_consul_rpc_duration_seconds_sum{call="services",endpoint="catalog"} 0 prometheus_sd_consul_rpc_duration_seconds_count{call="services",endpoint="catalog"} 0 # HELP prometheus_sd_consul_rpc_failures_total The number of Consul RPC call failures. # TYPE prometheus_sd_consul_rpc_failures_total counter prometheus_sd_consul_rpc_failures_total 0 # HELP prometheus_sd_consulagent_rpc_duration_seconds The duration of a Consul Agent RPC call in seconds. # TYPE prometheus_sd_consulagent_rpc_duration_seconds summary prometheus_sd_consulagent_rpc_duration_seconds{call="service",endpoint="agent",quantile="0.5"} NaN prometheus_sd_consulagent_rpc_duration_seconds{call="service",endpoint="agent",quantile="0.9"} NaN prometheus_sd_consulagent_rpc_duration_seconds{call="service",endpoint="agent",quantile="0.99"} NaN prometheus_sd_consulagent_rpc_duration_seconds_sum{call="service",endpoint="agent"} 0 prometheus_sd_consulagent_rpc_duration_seconds_count{call="service",endpoint="agent"} 0 prometheus_sd_consulagent_rpc_duration_seconds{call="services",endpoint="agent",quantile="0.5"} NaN prometheus_sd_consulagent_rpc_duration_seconds{call="services",endpoint="agent",quantile="0.9"} NaN prometheus_sd_consulagent_rpc_duration_seconds{call="services",endpoint="agent",quantile="0.99"} NaN prometheus_sd_consulagent_rpc_duration_seconds_sum{call="services",endpoint="agent"} 0 prometheus_sd_consulagent_rpc_duration_seconds_count{call="services",endpoint="agent"} 0 # HELP prometheus_sd_consulagent_rpc_failures_total The number of Consul Agent RPC call failures. # TYPE prometheus_sd_consulagent_rpc_failures_total counter prometheus_sd_consulagent_rpc_failures_total 0 # HELP prometheus_sd_discovered_targets Current number of discovered targets. # TYPE prometheus_sd_discovered_targets gauge prometheus_sd_discovered_targets{config="apps",name="scrape"} 15 prometheus_sd_discovered_targets{config="default",name=""} 1 # HELP prometheus_sd_dns_lookup_failures_total The number of DNS-SD lookup failures. # TYPE prometheus_sd_dns_lookup_failures_total counter prometheus_sd_dns_lookup_failures_total 0 # HELP prometheus_sd_dns_lookups_total The number of DNS-SD lookups. # TYPE prometheus_sd_dns_lookups_total counter prometheus_sd_dns_lookups_total 0 # HELP prometheus_sd_failed_configs Current number of service discovery configurations that failed to load. # TYPE prometheus_sd_failed_configs gauge prometheus_sd_failed_configs{name=""} 0 prometheus_sd_failed_configs{name="scrape"} 0 # HELP prometheus_sd_file_read_errors_total The number of File-SD read errors. # TYPE prometheus_sd_file_read_errors_total counter prometheus_sd_file_read_errors_total 0 # HELP prometheus_sd_file_scan_duration_seconds The duration of the File-SD scan in seconds. # TYPE prometheus_sd_file_scan_duration_seconds summary prometheus_sd_file_scan_duration_seconds{quantile="0.5"} NaN prometheus_sd_file_scan_duration_seconds{quantile="0.9"} NaN prometheus_sd_file_scan_duration_seconds{quantile="0.99"} NaN prometheus_sd_file_scan_duration_seconds_sum 0 prometheus_sd_file_scan_duration_seconds_count 0 # HELP prometheus_sd_http_failures_total Number of HTTP service discovery refresh failures. # TYPE prometheus_sd_http_failures_total counter prometheus_sd_http_failures_total 0 # HELP prometheus_sd_kubernetes_events_total The number of Kubernetes events handled. # TYPE prometheus_sd_kubernetes_events_total counter prometheus_sd_kubernetes_events_total{event="add",role="endpoints"} 0 prometheus_sd_kubernetes_events_total{event="add",role="endpointslice"} 0 prometheus_sd_kubernetes_events_total{event="add",role="ingress"} 0 prometheus_sd_kubernetes_events_total{event="add",role="node"} 0 prometheus_sd_kubernetes_events_total{event="add",role="pod"} 0 prometheus_sd_kubernetes_events_total{event="add",role="service"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="endpoints"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="endpointslice"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="ingress"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="node"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="pod"} 0 prometheus_sd_kubernetes_events_total{event="delete",role="service"} 0 prometheus_sd_kubernetes_events_total{event="update",role="endpoints"} 0 prometheus_sd_kubernetes_events_total{event="update",role="endpointslice"} 0 prometheus_sd_kubernetes_events_total{event="update",role="ingress"} 0 prometheus_sd_kubernetes_events_total{event="update",role="node"} 0 prometheus_sd_kubernetes_events_total{event="update",role="pod"} 0 prometheus_sd_kubernetes_events_total{event="update",role="service"} 0 # HELP prometheus_sd_kuma_fetch_duration_seconds The duration of a Kuma MADS fetch call. # TYPE prometheus_sd_kuma_fetch_duration_seconds summary prometheus_sd_kuma_fetch_duration_seconds{quantile="0.5"} NaN prometheus_sd_kuma_fetch_duration_seconds{quantile="0.9"} NaN prometheus_sd_kuma_fetch_duration_seconds{quantile="0.99"} NaN prometheus_sd_kuma_fetch_duration_seconds_sum 0 prometheus_sd_kuma_fetch_duration_seconds_count 0 # HELP prometheus_sd_kuma_fetch_failures_total The number of Kuma MADS fetch call failures. # TYPE prometheus_sd_kuma_fetch_failures_total counter prometheus_sd_kuma_fetch_failures_total 0 # HELP prometheus_sd_kuma_fetch_skipped_updates_total The number of Kuma MADS fetch calls that result in no updates to the targets. # TYPE prometheus_sd_kuma_fetch_skipped_updates_total counter prometheus_sd_kuma_fetch_skipped_updates_total 0 # HELP prometheus_sd_linode_failures_total Number of Linode service discovery refresh failures. # TYPE prometheus_sd_linode_failures_total counter prometheus_sd_linode_failures_total 0 # HELP prometheus_sd_nomad_failures_total Number of nomad service discovery refresh failures. # TYPE prometheus_sd_nomad_failures_total counter prometheus_sd_nomad_failures_total 0 # HELP prometheus_sd_received_updates_total Total number of update events received from the SD providers. # TYPE prometheus_sd_received_updates_total counter prometheus_sd_received_updates_total{name=""} 2 prometheus_sd_received_updates_total{name="scrape"} 2 # HELP prometheus_sd_updates_total Total number of update events sent to the SD consumers. # TYPE prometheus_sd_updates_total counter prometheus_sd_updates_total{name=""} 1 prometheus_sd_updates_total{name="scrape"} 1 # HELP prometheus_target_interval_length_seconds Actual intervals between scrapes. # TYPE prometheus_target_interval_length_seconds summary prometheus_target_interval_length_seconds{interval="5s",quantile="0.01"} 4.996479052 prometheus_target_interval_length_seconds{interval="5s",quantile="0.05"} 4.999216524 prometheus_target_interval_length_seconds{interval="5s",quantile="0.5"} 5.000034306 prometheus_target_interval_length_seconds{interval="5s",quantile="0.9"} 5.000669233 prometheus_target_interval_length_seconds{interval="5s",quantile="0.99"} 5.003766467 prometheus_target_interval_length_seconds_sum{interval="5s"} 400745.76176225476 prometheus_target_interval_length_seconds_count{interval="5s"} 80149 # HELP prometheus_target_metadata_cache_bytes The number of bytes that are currently used for storing metric metadata in the cache # TYPE prometheus_target_metadata_cache_bytes gauge prometheus_target_metadata_cache_bytes{scrape_job="apps"} 50947 # HELP prometheus_target_metadata_cache_entries Total number of metric metadata entries in the cache # TYPE prometheus_target_metadata_cache_entries gauge prometheus_target_metadata_cache_entries{scrape_job="apps"} 1016 # HELP prometheus_target_scrape_pool_exceeded_label_limits_total Total number of times scrape pools hit the label limits, during sync or config reload. # TYPE prometheus_target_scrape_pool_exceeded_label_limits_total counter prometheus_target_scrape_pool_exceeded_label_limits_total 0 # HELP prometheus_target_scrape_pool_exceeded_target_limit_total Total number of times scrape pools hit the target limit, during sync or config reload. # TYPE prometheus_target_scrape_pool_exceeded_target_limit_total counter prometheus_target_scrape_pool_exceeded_target_limit_total 0 # HELP prometheus_target_scrape_pool_reloads_failed_total Total number of failed scrape pool reloads. # TYPE prometheus_target_scrape_pool_reloads_failed_total counter prometheus_target_scrape_pool_reloads_failed_total 0 # HELP prometheus_target_scrape_pool_reloads_total Total number of scrape pool reloads. # TYPE prometheus_target_scrape_pool_reloads_total counter prometheus_target_scrape_pool_reloads_total 0 # HELP prometheus_target_scrape_pool_sync_total Total number of syncs that were executed on a scrape pool. # TYPE prometheus_target_scrape_pool_sync_total counter prometheus_target_scrape_pool_sync_total{scrape_job="apps"} 1 # HELP prometheus_target_scrape_pool_targets Current number of targets in this scrape pool. # TYPE prometheus_target_scrape_pool_targets gauge prometheus_target_scrape_pool_targets{scrape_job="apps"} 15 # HELP prometheus_target_scrape_pools_failed_total Total number of scrape pool creations that failed. # TYPE prometheus_target_scrape_pools_failed_total counter prometheus_target_scrape_pools_failed_total 0 # HELP prometheus_target_scrape_pools_total Total number of scrape pool creation attempts. # TYPE prometheus_target_scrape_pools_total counter prometheus_target_scrape_pools_total 1 # HELP prometheus_target_scrapes_cache_flush_forced_total How many times a scrape cache was flushed due to getting big while scrapes are failing. # TYPE prometheus_target_scrapes_cache_flush_forced_total counter prometheus_target_scrapes_cache_flush_forced_total 0 # HELP prometheus_target_scrapes_exceeded_body_size_limit_total Total number of scrapes that hit the body size limit # TYPE prometheus_target_scrapes_exceeded_body_size_limit_total counter prometheus_target_scrapes_exceeded_body_size_limit_total 0 # HELP prometheus_target_scrapes_exceeded_sample_limit_total Total number of scrapes that hit the sample limit and were rejected. # TYPE prometheus_target_scrapes_exceeded_sample_limit_total counter prometheus_target_scrapes_exceeded_sample_limit_total 0 # HELP prometheus_target_scrapes_exemplar_out_of_order_total Total number of exemplar rejected due to not being out of the expected order. # TYPE prometheus_target_scrapes_exemplar_out_of_order_total counter prometheus_target_scrapes_exemplar_out_of_order_total 0 # HELP prometheus_target_scrapes_sample_duplicate_timestamp_total Total number of samples rejected due to duplicate timestamps but different values. # TYPE prometheus_target_scrapes_sample_duplicate_timestamp_total counter prometheus_target_scrapes_sample_duplicate_timestamp_total 0 # HELP prometheus_target_scrapes_sample_out_of_bounds_total Total number of samples rejected due to timestamp falling outside of the time bounds. # TYPE prometheus_target_scrapes_sample_out_of_bounds_total counter prometheus_target_scrapes_sample_out_of_bounds_total 0 # HELP prometheus_target_scrapes_sample_out_of_order_total Total number of samples rejected due to not being out of the expected order. # TYPE prometheus_target_scrapes_sample_out_of_order_total counter prometheus_target_scrapes_sample_out_of_order_total 0 # HELP prometheus_target_sync_failed_total Total number of target sync failures. # TYPE prometheus_target_sync_failed_total counter prometheus_target_sync_failed_total{scrape_job="apps"} 0 # HELP prometheus_target_sync_length_seconds Actual interval to sync the scrape pool. # TYPE prometheus_target_sync_length_seconds summary prometheus_target_sync_length_seconds{scrape_job="apps",quantile="0.01"} NaN prometheus_target_sync_length_seconds{scrape_job="apps",quantile="0.05"} NaN prometheus_target_sync_length_seconds{scrape_job="apps",quantile="0.5"} NaN prometheus_target_sync_length_seconds{scrape_job="apps",quantile="0.9"} NaN prometheus_target_sync_length_seconds{scrape_job="apps",quantile="0.99"} NaN prometheus_target_sync_length_seconds_sum{scrape_job="apps"} 0.00056469 prometheus_target_sync_length_seconds_count{scrape_job="apps"} 1 # HELP prometheus_template_text_expansion_failures_total The total number of template text expansion failures. # TYPE prometheus_template_text_expansion_failures_total counter prometheus_template_text_expansion_failures_total 0 # HELP prometheus_template_text_expansions_total The total number of template text expansions. # TYPE prometheus_template_text_expansions_total counter prometheus_template_text_expansions_total 0 # HELP prometheus_treecache_watcher_goroutines The current number of watcher goroutines. # TYPE prometheus_treecache_watcher_goroutines gauge prometheus_treecache_watcher_goroutines 0 # HELP prometheus_treecache_zookeeper_failures_total The total number of ZooKeeper failures. # TYPE prometheus_treecache_zookeeper_failures_total counter prometheus_treecache_zookeeper_failures_total 0 # HELP prometheus_tsdb_wal_completed_pages_total Total number of completed pages. # TYPE prometheus_tsdb_wal_completed_pages_total counter prometheus_tsdb_wal_completed_pages_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 2944 # HELP prometheus_tsdb_wal_fsync_duration_seconds Duration of write log fsync. # TYPE prometheus_tsdb_wal_fsync_duration_seconds summary prometheus_tsdb_wal_fsync_duration_seconds{instance_group_name="0928abe60ab577f09352172478cfb1eb",quantile="0.5"} NaN prometheus_tsdb_wal_fsync_duration_seconds{instance_group_name="0928abe60ab577f09352172478cfb1eb",quantile="0.9"} NaN prometheus_tsdb_wal_fsync_duration_seconds{instance_group_name="0928abe60ab577f09352172478cfb1eb",quantile="0.99"} NaN prometheus_tsdb_wal_fsync_duration_seconds_sum{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0.05767714799999999 prometheus_tsdb_wal_fsync_duration_seconds_count{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 7 # HELP prometheus_tsdb_wal_page_flushes_total Total number of page flushes. # TYPE prometheus_tsdb_wal_page_flushes_total counter prometheus_tsdb_wal_page_flushes_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 83115 # HELP prometheus_tsdb_wal_segment_current Write log segment index that TSDB is currently writing to. # TYPE prometheus_tsdb_wal_segment_current gauge prometheus_tsdb_wal_segment_current{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 124 # HELP prometheus_tsdb_wal_truncations_failed_total Total number of write log truncations that failed. # TYPE prometheus_tsdb_wal_truncations_failed_total counter prometheus_tsdb_wal_truncations_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP prometheus_tsdb_wal_truncations_total Total number of write log truncations attempted. # TYPE prometheus_tsdb_wal_truncations_total counter prometheus_tsdb_wal_truncations_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 4 # HELP prometheus_tsdb_wal_writes_failed_total Total number of write log writes that failed. # TYPE prometheus_tsdb_wal_writes_failed_total counter prometheus_tsdb_wal_writes_failed_total{instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP prometheus_wal_watcher_current_segment Current segment the WAL watcher is reading records from. # TYPE prometheus_wal_watcher_current_segment gauge prometheus_wal_watcher_current_segment{consumer="0928ab-4a5a87",instance_group_name="0928abe60ab577f09352172478cfb1eb"} 124 # HELP prometheus_wal_watcher_record_decode_failures_total Number of records read by the WAL watcher that resulted in an error when decoding. # TYPE prometheus_wal_watcher_record_decode_failures_total counter prometheus_wal_watcher_record_decode_failures_total{consumer="0928ab-4a5a87",instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP prometheus_wal_watcher_records_read_total Number of records read by the WAL watcher from the WAL. # TYPE prometheus_wal_watcher_records_read_total counter prometheus_wal_watcher_records_read_total{consumer="0928ab-4a5a87",instance_group_name="0928abe60ab577f09352172478cfb1eb",type="samples"} 112168 prometheus_wal_watcher_records_read_total{consumer="0928ab-4a5a87",instance_group_name="0928abe60ab577f09352172478cfb1eb",type="series"} 467 # HELP prometheus_wal_watcher_samples_sent_pre_tailing_total Number of sample records read by the WAL watcher and sent to remote write during replay of existing WAL. # TYPE prometheus_wal_watcher_samples_sent_pre_tailing_total counter prometheus_wal_watcher_samples_sent_pre_tailing_total{consumer="0928ab-4a5a87",instance_group_name="0928abe60ab577f09352172478cfb1eb"} 0 # HELP promtail_batch_retries_total Number of times batches has had to be retried. # TYPE promtail_batch_retries_total counter promtail_batch_retries_total{host="monitoring_grafana-loki:3100",logs_config="default",tenant=""} 0 # HELP promtail_config_reload_fail_total Number of reload fail times. # TYPE promtail_config_reload_fail_total counter promtail_config_reload_fail_total{logs_config="default"} 0 # HELP promtail_config_reload_success_total Number of reload success times. # TYPE promtail_config_reload_success_total counter promtail_config_reload_success_total{logs_config="default"} 0 # HELP promtail_dropped_bytes_total Number of bytes dropped because failed to be sent to the ingester after all retries. # TYPE promtail_dropped_bytes_total counter promtail_dropped_bytes_total{host="monitoring_grafana-loki:3100",logs_config="default",tenant=""} 0 # HELP promtail_dropped_entries_total Number of log entries dropped because failed to be sent to the ingester after all retries. # TYPE promtail_dropped_entries_total counter promtail_dropped_entries_total{host="monitoring_grafana-loki:3100",logs_config="default",tenant=""} 0 # HELP promtail_encoded_bytes_total Number of bytes encoded and ready to send. # TYPE promtail_encoded_bytes_total counter promtail_encoded_bytes_total{host="monitoring_grafana-loki:3100",logs_config="default"} 8.7826793e+07 # HELP promtail_file_bytes_total Number of bytes total. # TYPE promtail_file_bytes_total gauge promtail_file_bytes_total{logs_config="default",path="/tmp/logs/A-service/A-service-1"} 1.445682e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/B-service/B-service-1"} 935881 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/C-service/C-service-1"} 258720 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/config-server/config-server-1"} 18802 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/D-service/D-service-1"} 1.245576e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/E-service/E-service-1"} 1.509905e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/F-service/F-service-1"} 158772 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/G-service/G-service-1"} 1.733845e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/H-service/H-service-1"} 159454 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/I-service/I-service-1"} 1243 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/J-service/J-service-1"} 1.668629e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/K-service/K-service-1"} 618515 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/L-service/L-service-1"} 23558 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/M-service/M-service-1"} 1.43016e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/N-service/N-service-1"} 33570 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/O-service/O-service-1"} 443180 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/P-service/P-service-1"} 2.01473e+06 promtail_file_bytes_total{logs_config="default",path="/tmp/logs/Q-service/Q-service-1"} 9368 # HELP promtail_files_active_total Number of active files. # TYPE promtail_files_active_total gauge promtail_files_active_total{logs_config="default"} 18 # HELP promtail_read_bytes_total Number of bytes read. # TYPE promtail_read_bytes_total gauge promtail_read_bytes_total{logs_config="default",path="/tmp/logs/A-service/A-service-1"} 1.445682e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/B-service/B-service-1"} 935881 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/C-service/C-service-1"} 258720 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/config-server/config-server-1"} 18802 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/D-service/D-service-1"} 1.245576e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/E-service/E-service-1"} 1.509905e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/F-service/F-service-1"} 158772 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/G-service/G-service-1"} 1.733845e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/H-service/H-service-1"} 159454 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/I-service/I-service-1"} 1243 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/J-service/J-service-1"} 1.668629e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/K-service/K-service-1"} 618515 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/L-service/L-service-1"} 23558 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/M-service/M-service-1"} 1.43016e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/N-service/N-service-1"} 33570 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/O-service/O-service-1"} 443180 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/P-service/P-service-1"} 2.01473e+06 promtail_read_bytes_total{logs_config="default",path="/tmp/logs/Q-service/Q-service-1"} 9368 # HELP promtail_read_lines_total Number of lines read. # TYPE promtail_read_lines_total counter promtail_read_lines_total{logs_config="default",path="/tmp/logs/A-service/A-service-1"} 142 promtail_read_lines_total{logs_config="default",path="/tmp/logs/B-service/B-service-1"} 1931 promtail_read_lines_total{logs_config="default",path="/tmp/logs/C-service/C-service-1"} 218 promtail_read_lines_total{logs_config="default",path="/tmp/logs/config-server/config-server-1"} 66 promtail_read_lines_total{logs_config="default",path="/tmp/logs/D-service/D-service-1"} 675 promtail_read_lines_total{logs_config="default",path="/tmp/logs/E-service/E-service-1"} 719 promtail_read_lines_total{logs_config="default",path="/tmp/logs/F-service/F-service-1"} 193 promtail_read_lines_total{logs_config="default",path="/tmp/logs/G-service/G-service-1"} 588 promtail_read_lines_total{logs_config="default",path="/tmp/logs/H-service/H-service-1"} 73 promtail_read_lines_total{logs_config="default",path="/tmp/logs/J-service/J-service-1"} 2623 promtail_read_lines_total{logs_config="default",path="/tmp/logs/K-service/K-service-1"} 299 promtail_read_lines_total{logs_config="default",path="/tmp/logs/L-service/L-service-1"} 51 promtail_read_lines_total{logs_config="default",path="/tmp/logs/M-service/M-service-1"} 3582 promtail_read_lines_total{logs_config="default",path="/tmp/logs/N-service/N-service-1"} 49 promtail_read_lines_total{logs_config="default",path="/tmp/logs/O-service/O-service-1"} 87 promtail_read_lines_total{logs_config="default",path="/tmp/logs/P-service/P-service-1"} 952 promtail_read_lines_total{logs_config="default",path="/tmp/logs/Q-service/Q-service-1"} 45 # HELP promtail_request_duration_seconds Duration of send requests. # TYPE promtail_request_duration_seconds histogram promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.005"} 5460 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.01"} 6173 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.025"} 6439 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.05"} 6479 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.1"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.25"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="0.5"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="1"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="2.5"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="5"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="10"} 6486 promtail_request_duration_seconds_bucket{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204",le="+Inf"} 6486 promtail_request_duration_seconds_sum{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204"} 21.424610954999892 promtail_request_duration_seconds_count{host="monitoring_grafana-loki:3100",logs_config="default",status_code="204"} 6486 # HELP promtail_sent_bytes_total Number of bytes sent. # TYPE promtail_sent_bytes_total counter promtail_sent_bytes_total{host="monitoring_grafana-loki:3100",logs_config="default"} 8.7826793e+07 # HELP promtail_sent_entries_total Number of log entries sent to the ingester. # TYPE promtail_sent_entries_total counter promtail_sent_entries_total{host="monitoring_grafana-loki:3100",logs_config="default"} 172663 # HELP promtail_stream_lag_seconds Difference between current time and last batch timestamp for successful sends # TYPE promtail_stream_lag_seconds gauge promtail_stream_lag_seconds{client="aaa468",host="monitoring_grafana-loki:3100",logs_config="default"} 1.029602911 # HELP promtail_targets_active_total Number of active total. # TYPE promtail_targets_active_total gauge promtail_targets_active_total{logs_config="default"} 1 # HELP traces_exporter_enqueue_failed_log_records Number of log records failed to be added to the sending queue. # TYPE traces_exporter_enqueue_failed_log_records counter traces_exporter_enqueue_failed_log_records{exporter="otlp/0",traces_config="default"} 0 # HELP traces_exporter_enqueue_failed_metric_points Number of metric points failed to be added to the sending queue. # TYPE traces_exporter_enqueue_failed_metric_points counter traces_exporter_enqueue_failed_metric_points{exporter="otlp/0",traces_config="default"} 0 # HELP traces_exporter_enqueue_failed_spans Number of spans failed to be added to the sending queue. # TYPE traces_exporter_enqueue_failed_spans counter traces_exporter_enqueue_failed_spans{exporter="otlp/0",traces_config="default"} 0 # HELP traces_exporter_queue_capacity Fixed capacity of the retry queue (in batches) # TYPE traces_exporter_queue_capacity gauge traces_exporter_queue_capacity{exporter="otlp/0",traces_config="default"} 5000 # HELP traces_exporter_queue_size Current size of the retry queue (in batches) # TYPE traces_exporter_queue_size gauge traces_exporter_queue_size{exporter="otlp/0",traces_config="default"} 0 # HELP traces_exporter_send_failed_requests number of times exporters failed to send requests to the destination # TYPE traces_exporter_send_failed_requests counter traces_exporter_send_failed_requests{traces_config="default"} 126 # HELP traces_exporter_send_failed_spans Number of spans in failed attempts to send to destination. # TYPE traces_exporter_send_failed_spans counter traces_exporter_send_failed_spans{exporter="otlp/0",traces_config="default"} 624650 # HELP traces_exporter_sent_spans Number of spans successfully sent to destination. # TYPE traces_exporter_sent_spans counter traces_exporter_sent_spans{exporter="otlp/0",traces_config="default"} 1.072098e+06 # HELP traces_receiver_accepted_spans Number of spans successfully pushed into the pipeline. # TYPE traces_receiver_accepted_spans counter traces_receiver_accepted_spans{receiver="otlp",traces_config="default",transport="grpc"} 1.696854e+06 # HELP traces_receiver_refused_spans Number of spans that could not be pushed into the pipeline. # TYPE traces_receiver_refused_spans counter traces_receiver_refused_spans{receiver="otlp",traces_config="default",transport="grpc"} 0 ```

I masked the names of the services (sorry for that, it's from a live environment) but left the actual metric values.

The problem that I described mostly affects what's called "A-service" there, maybe this matters. This service generates much more logs than the others and rotates log files every 10-15 min. However, other services can face the same problem sometimes.

Logs rotation is done by the Logback library, RollingFileAppender (with SizeAndTimeBasedRollingPolicy - depends on either file size or time). The "rotated" logs are moved to a separate folder in tar.gz format.

Please tell me if I need to provide something else.

thampiotr commented 1 year ago

Thanks @amseager, this was very helpful. A few follow-up questions:

  1. When you get the "skipping update of position..." logs every 10 seconds, are they all for the same file path? I can see the example you've provided in your original description, but since you had to redact the service names, I was wondering if they were perhaps all different files originally?
  2. Can you provide more Agent logs (redacted are absolutely fine) that contain the component=logs? Specifically for the period just before and right after the "skipping update of position...". This should give me some information about what filesystem operations does the Agent sees (if any).
  3. Just to re-confirm: a freshly started Agent doesn't log this initally (even in presence of log rotations), but after it has been running for a longer time and experienced more log rotations, this starts every 10s or so?
thampiotr commented 1 year ago

@amseager one more request if possible, can you include a goroutine dump of the agent? this can be obtained via http://localhost:12345/debug/pprof/goroutine?debug=1.

amseager commented 1 year ago

@thampiotr

please check the logs for 10 min or smth: https://gist.github.com/amseager/621ec009f4d239d205ae5e19e148ad7d

Goroutine dump: https://gist.github.com/amseager/e37f31984facd9cbc51ce0f2153d6864

1 - they can be for the same path and for different ones at the same time. For example, 10 lines for A-service, 1 for B-service, etc. for a single "10-seconds-iteration"

2 - apart from unrelated things, I see some errors like the next one happening sometimes, maybe this is related:

ts=2023-05-19T14:49:42.756314281Z caller=tailer.go:159 level=info component=logs logs_config=default component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/tmp/logs/A-service/A-service-1 reason="Error reading /tmp/logs/A-service/A-service-1: read /tmp/logs/A-service/A-service-1: stale NFS file handle"

3- correct, if I restart the agent, it will work well for some time, then a small amount of these errors start to appear (maybe after the first rotations, it's hard to understand), then there will be more and more of them. The logs above are taken from the agent which has been working for ~3 days. Although there are tons of them for a short period of time, I still don't see any issues like lost logs in Loki or lack of resources etc.

thampiotr commented 1 year ago

Thanks @amseager for sharing all the info. I found a goroutine leak and it is likely related to stale NFS file handle error you saw. Still need to figure out exactly where the fix is needed, but we're getting closer.

While at it, I'll be making sure we port https://github.com/grafana/tail/pull/16 too.

thampiotr commented 1 year ago

Reopening, since there are still affected code paths in the agent and PRs to update the dependencies needed.

rfratto commented 1 year ago

This is currently blocked by #3660.

thampiotr commented 1 year ago

Some more context: in order to fix this in agent static mode, we'll need to update some promtail dependencies from loki. When doing so, we get dependency conflicts with Otel Collector. So the update of Otel Collector should happen first.

amseager commented 1 year ago

Thanks for the update. Maybe you know if it's safe to ignore the problem or if any actions/workarounds need to be done before the fix is released? For now, I reduced the number of logs for some apps, and I also try to restart the agent every few days.

thampiotr commented 1 year ago

This is caused by a goroutine leak, which can result in excessive, unnecessary logging (as you have noticed), growing memory usage of the agent over time (though it shouldn't be too fast) and some additional context switching. I'd say this is safe to ignore until the symptoms get too bad, at which point a restart is a good remedy.

Apologies for the hassle caused by this and thanks for reporting the problem!

thampiotr commented 1 year ago

This is now fixed in main with https://github.com/grafana/agent/pull/3942 and https://github.com/grafana/agent/pull/4565 adding the release note. Should be released in v0.36.0 - it was blocked by upgrading our Prometheus & OTEL dependency hence the delay.

Upanshu11 commented 10 months ago

Hello, I am facing same issue in grafana agent v0.36.2 as well. Grafana Agent Operator version is v0.34.1 I am using grafana agent in Kubernetes AKS.

Getting the below logs:

ts=2024-01-11T02:39:05.911958356Z caller=filetargetmanager.go:181 level=info component=logs logs_config=random-shared/kubernetes-pod-logs msg="received file watcher event" name=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log op=CREATE
ts=2024-01-11T02:39:05.912455361Z caller=tailer.go:145 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: started" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log
ts=2024-01-11T02:39:05.912477961Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Seeked /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/destination/6736.log - &{Offset:0 Whence:0}"
ts=2024-01-11T02:39:06.93727296Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Re-opening moved/deleted file /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log ..."
ts=2024-01-11T02:39:06.937438161Z caller=log.go:168 component=logs logs_config=random-shared/kubernetes-pod-logs level=info msg="Waiting for /var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log to appear..."
ts=2024-01-11T02:39:08.979770282Z caller=tailer.go:206 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="skipping update of position for a file which does not currently exist" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135688578Z caller=tailer.go:206 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="skipping update of position for a file which does not currently exist" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135755578Z caller=tailer.go:163 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log reason=null
ts=2024-01-11T02:39:10.135789979Z caller=tailer.go:154 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="tail routine: exited" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135809179Z caller=tailer.go:118 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="position timer: exited" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log
ts=2024-01-11T02:39:10.135822179Z caller=tailer.go:242 level=info component=logs logs_config=random-shared/kubernetes-pod-logs component=tailer msg="stopped tailing file" path=/var/log/pods/linkerd_linkerd-destination-5d948754fc-tc966_88b95f24-dc55-4fce-8284-54152c112051/sp-validator/6730.log

Should I update the operator as well?

Upanshu11 commented 9 months ago

Facing the same issue in grafana/agent v0.39.0 as well after upgrading. @thampiotr any help?

supergillis commented 9 months ago

Same issue here on grafana/agent:v0.39.1.

Upanshu11 commented 9 months ago

Hey @tpaschalis any help?

rfratto commented 9 months ago

Hey folks, we don't monitor closed issues or PRs. I happened to notice this in my notifications list, but it's not a reliable way of getting attention.

If you need help for something which is closed and want to make sure it gets seen, please open a new issue or PR.