Open nc-gcz opened 2 weeks ago
same here. but im using official grafana tempo-distributed helm chart
cache:
caches:
- memcached:
consistent_hash: true
host: 'test-tempo-memcached'
service: memcached-client
timeout: 500ms
roles:
- parquet-footer
- bloom
- frontend-search
compactor:
compaction:
block_retention: 72h
compacted_block_retention: 1h
compaction_cycle: 30s
compaction_window: 1h
max_block_bytes: 107374182400
max_compaction_objects: 6000000
max_time_per_tenant: 5m
retention_concurrency: 10
v2_in_buffer_bytes: 5242880
v2_out_buffer_bytes: 20971520
v2_prefetch_traces_count: 1000
ring:
kvstore:
store: memberlist
distributor:
log_received_spans:
enabled: true
filter_by_status_error: true
include_all_attributes: true
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 4194304
http:
endpoint: 0.0.0.0:4318
ring:
kvstore:
store: memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
tokens_file_path: /var/tempo/tokens.json
memberlist:
abort_if_cluster_join_fails: false
bind_addr: []
bind_port: 7946
gossip_interval: 1s
gossip_nodes: 2
gossip_to_dead_nodes_time: 30s
join_members:
- dns+test-tempo-gossip-ring:7946
leave_timeout: 5s
left_ingesters_timeout: 5m
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
node_name: ""
packet_dial_timeout: 5s
packet_write_timeout: 5s
pull_push_interval: 30s
randomize_node_name: true
rejoin_interval: 0s
retransmit_factor: 2
stream_timeout: 10s
metrics_generator:
metrics_ingestion_time_range_slack: 30s
processor:
service_graphs:
dimensions: []
enable_client_server_prefix: true
enable_messaging_system_latency_histogram: true
enable_virtual_node_label: true
histogram_buckets: []
max_items: 10000
wait: 10s
workers: 10
span_metrics:
dimensions: []
histogram_buckets: []
registry:
collection_interval: 15s
external_labels:
collection_interval: 15s
external_labels:
source: tempo
source: tempo
stale_duration: 15m
ring:
kvstore:
store: memberlist
storage:
path: /var/tempo/generator/wal
remote_write:
- headers:
X-Scope-OrgID: test
send_exemplars: true
url: http://observability-mimir-distributor:8080/api/v1/push
remote_write_add_org_id_header: false
remote_write_flush_deadline: 1m
traces_storage:
path: /var/tempo/generator/traces
multitenancy_enabled: true
overrides:
per_tenant_override_config: /runtime-config/overrides.yaml
querier:
frontend_worker:
frontend_address: test-tempo-query-frontend-discovery:9095
grpc_client_config:
max_send_msg_size: 4194304
max_concurrent_queries: 20
search:
external_backend: null
external_endpoints: []
external_hedge_requests_at: 8s
external_hedge_requests_up_to: 2
prefer_self: 10
query_timeout: 30s
trace_by_id:
query_timeout: 10s
query_frontend:
max_outstanding_per_tenant: 2000
max_retries: 2
search:
concurrent_jobs: 1000
target_bytes_per_job: 104857600
trace_by_id:
query_shards: 50
server:
grpc_server_max_recv_msg_size: 4194304
grpc_server_max_send_msg_size: 4194304
http_listen_port: 3100
http_server_read_timeout: 30s
http_server_write_timeout: 30s
log_format: logfmt
log_level: info
storage:
trace:
backend: s3
blocklist_poll: 5m
local:
path: /var/tempo/traces
pool:
max_workers: 400
queue_depth: 20000
s3:
access_key: 123
bucket: tempo
endpoint: observability-minio:9000
forcepathstyle: true
insecure: false
secret_key: 123
tls_insecure_skip_verify: true
wal:
path: /var/tempo/wal
usage_report:
reporting_enabled: false
I'm not sure if this is the limitaion in distributed mode. When I tried it with Docker Compose, it worked because the metrics generator and ingester read and write data to the same local path. However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.
someone from Grafana team can enlighten us on this
In our set-up, ingester
indeed has a persistent volume claim:
spec:
volumes:
- name: data
persistentVolumeClaim:
claimName: data-grafana-tempo-ingester-0
But doesn't seem to be mounted into metrics-generator
.
Another thing that we've spotted when running the tests was the following behaviour:
level=info ts=2024-07-10T14:50:55.619034542Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=44556
level=info ts=2024-07-10T14:50:55.62145804Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=45901
level=info ts=2024-07-10T14:51:10.515918323Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-10T14:51:25.5159173Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
At some point during the tests (that in our case generate traffic across 30+ microservices) it basically stopped generating metrics. After I detected that in the morning next day, and restarted the metrics-generator
, it stopped after printing the following log lines:
level=info ts=2024-07-11T12:42:06.162281707Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2024-07-11T12:42:06.162851639Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=2
level=info ts=2024-07-11T12:42:06.173897462Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=11.048845ms
level=info ts=2024-07-11T12:42:06.173946198Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=test-grafana-tempo-gossip-ring
level=info ts=2024-07-11T12:42:06.17426984Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-6cc769bdd7-zh89p ring=metrics-generator
level=info ts=2024-07-11T12:42:06.174649825Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-11T12:42:06.185328903Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=11.382705ms
Nothing is being printed about even attempting to connect to Prometheus instance, which in our case, is completetly idle, not restarting, working as expected, always on the same URL.
Hey all, can you review these steps:
https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/
This will help us identify where things are being dropped.
However, in a Kubernetes environment under distributed mode, all components run as separate containers with separate volumes. I'm not sure if the metrics generator and ingester share the same path.
Generators and ingesters do not need shared access to block storage. They are fine to run independently.
Hello @joe-elliott, thank you for reaching out.
As a part of the troubleshooting, our Operations Team has reverted from Grafana Tempo 2.5.0
to 2.1.1
- we had issues with the same behaviour on 2.5.0
that we thought would be fixed by downgrading to 2.1.1
(older version we've run), but it doesn't seem to help - we see the same behaviour on both versions.
I will follow the matching troubleshooting guide from here: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/
sum(increase(traces_service_graph_request_total[1m]))
- empty result from last 12 hours, no graph visiblesum(increase(tempo_distributor_ingester_appends_total[1m]))
- value of around 300-350 in last 12 hourssum(increase(tempo_distributor_metrics_generator_pushes_total[1m]))
- empty result from last 12 hours, no graph visiblesum(increase(tempo_distributor_metrics_generator_pushes_failures_total[1m]))
- empty result from last 12 hours, no graph visiblerate(tempo_metrics_generator_registry_collections_failed_total[1m])
- empty result from last 12 hours, no graph visibleMetrics like: prometheus_remote_storage_samples_total
, prometheus_remote_storage_bytes_total
or e.g. prometheus_remote_storage_samples_failed_total
are reporting empty results in Prometheus for last 24 hours.
sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m]))
- empty result from last 12 hours, no graph visiblesum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m]))
- as above,sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason)
- as above,sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m]))
- as above,prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total
- as above,sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m]))
- as above,sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m]))
- as above,sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m]))
- as above.It seems that metrics-generator
is completely not sending the metrics at all, but it's reporting itself as healthy.
It used to work when we had a reinstall of a Helm chart and after some time and tests, it stopped working, reporting active_series=0
for a prolonged time only to be restarted again and not picking up and sending any metrics to Prometheus. So it's not like it never works - it works for some time and then it stops :(
I've attached the example body of the metrics
endpoint for metrics-generator
that is missing all metrics related to service graph, span metrics and internal metrics as specified above (that are empty):
metrics-generator-example-metrics-endpoint-response-body.txt
Prometheus reports it as scraping correctly:
Also, we've restarted again the Grafana Tempo - clean deployed the Helm chart version 2.1.1
.
What we can see now is that the active_series=
log is present and what's most important, the following log line is present:
level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant
When we have a situation where metrics-generator
is idle, the log from the component ends before this line, so the last line of log is:
level=info ts=2024-07-12T10:05:50.298578151Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=test-grafana-tempo-metrics-generator-589965b6b5-qdrvt ring=metrics-generator
level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms
Where, for a working metrics-generator
it's like this:
level=info ts=2024-07-12T10:05:50.298862997Z caller=app.go:188 msg="Tempo started"
level=info ts=2024-07-12T10:05:50.30847801Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=4 elapsed_time=9.939381ms
level=info ts=2024-07-12T10:05:50.480474209Z caller=instance.go:46 tenant=single-tenant msg="creating WAL" dir=/bitnami/grafana-tempo/data/wal/single-tenant
ts=2024-07-12T10:05:50.481543325Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-0
ts=2024-07-12T10:05:50.481582922Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.481763202Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting WAL watcher" queue=prometheus-1
ts=2024-07-12T10:05:50.481772492Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Starting scraped metadata watcher"
ts=2024-07-12T10:05:50.483207769Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-0
ts=2024-07-12T10:05:50.483562004Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Replaying WAL" queue=prometheus-1
level=info ts=2024-07-12T10:05:50.48629064Z caller=db.go:336 tenant=single-tenant component=wal msg="replaying WAL, this may take a while" dir=/bitnami/grafana-tempo/data/wal/single-tenant/wal
level=info ts=2024-07-12T10:05:50.486461227Z caller=db.go:387 tenant=single-tenant component=wal msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2024-07-12T10:06:05.514274932Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6074
ts=2024-07-12T10:06:05.551793586Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-1 url=http://prometheus-monitoring-kube-prometheus-prometheus-1.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.068244585s
ts=2024-07-12T10:06:05.552795859Z caller=dedupe.go:112 tenant=single-tenant component=remote level=info remote_name=prometheus-0 url=http://prometheus-monitoring-kube-prometheus-prometheus-0.prometheus-operated.monitoring.svc.cluster.local:9090/api/v1/write msg="Done replaying WAL" duration=15.069655298s
On a idle metrics-generator
, /bitnami/grafana-tempo/data/wal
path is empty. But when it correctly starts, it has data:
/bitnami/grafana-tempo/data
is an emptyDir
in Kubernetes deployment, as installed from Helm chart from Bitnami.
It stopped again:
level=info ts=2024-07-12T10:43:33.390479119Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:43:48.393775049Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:03.390660555Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=6294
level=info ts=2024-07-12T10:44:18.379634162Z caller=registry.go:243 tenant=single-tenant msg="deleted stale series" active_series=0
level=info ts=2024-07-12T10:44:18.379708084Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
level=info ts=2024-07-12T10:44:33.379406636Z caller=registry.go:222 tenant=single-tenant msg="collecting metrics" active_series=0
As we used to have some metrics for a short period of time, I'd like to get the metrics from troubleshooting guide again:
sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m]))
:Empty
sum(rate(tempo_distributor_metrics_generator_pushes_failures_total{}[1m]))
:sum(rate(tempo_metrics_generator_spans_discarded_total{}[1m])) by (reason)
:sum(rate(tempo_metrics_generator_registry_series_limited_total{}[1m]))
:Empty
prometheus_remote_storage_(samples|exemplars)_(failed|dropped)_total
:sum(rate(tempo_metrics_generator_processor_service_graphs_expired_edges{}[1m]))
:sum(rate(tempo_metrics_generator_processor_service_graphs_edges{}[1m]))
:sum(rate(tempo_metrics_generator_processor_service_graphs_dropped_spans{}[1m]))
:By checking out code locally, 2.1.1
, and looking at modules/generator/storage/instance.go
, we can conclude that at some point:
// New creates a metrics WAL that remote writes its data.
func New(cfg *Config, tenant string, reg prometheus.Registerer, logger log.Logger) (Storage, error) {
logger = log.With(logger, "tenant", tenant)
reg = prometheus.WrapRegistererWith(prometheus.Labels{"tenant": tenant}, reg)
walDir := filepath.Join(cfg.Path, tenant)
level.Info(logger).Log("msg", "creating WAL", "dir", walDir) // This is never called
The creating WAL
is never called - we cannot see it in logs.
It has a chance to exit before New
method can be called in two instances in modules/generator/generator.go
:
func (g *Generator) getOrCreateInstance(instanceID string) (*instance, error) {
here:
inst, ok := g.getInstanceByID(instanceID)
if ok {
return inst, nil
} // <<< HERE
g.instancesMtx.Lock()
defer g.instancesMtx.Unlock()
inst, ok = g.instances[instanceID] // <<< OR HERE
if ok {
return inst, nil
}
// This is never reached apparently?
inst, err := g.createInstance(instanceID)
if err != nil {
return nil, err
}
g.instances[instanceID] = inst
return inst, nil
}
Why it's not being called - I am not able to tell :( Any help needed here. Thank you.
I would definitely not recommend upgrading from 2.1.x to 2.5.x. We aggressively rotate through block formats and it's quite possible blocks created by 2.1 are not supported by 2.5. also, if you're doing fresh installs. can you just install 2.5?
But I'm not sure if that is causing your issue or not. You're saying that on a fresh install of 2.1.1 it worked for awhile and then quit? Those graphs suggest the distributors just stopped sending data to the generators. Can you check:
/metrics-generator/ring
on the distributorsWhy it's not being called - I am not able to tell :( Any help needed here. Thank you.
I believe it's not being called b/c if you don't send data to the generator it will never create an instance. it creates a per tenant instance on demand. the missing log line suggests it's just not receiving any data which is consistent with your other graphs.
In my tempo distributor metrics I found this metric
# HELP tempo_distributor_ingester_clients The current number of ingester clients.
# TYPE tempo_distributor_ingester_clients gauge
tempo_distributor_ingester_clients 1
# HELP tempo_distributor_metrics_generator_clients The current number of metrics-generator clients.
# TYPE tempo_distributor_metrics_generator_clients gauge
tempo_distributor_metrics_generator_clients 0
metrics generator client is not configured it seems.
@nc-gcz IIRC have you tried enabling metrics generator by setting values in global_overrides
?
global_overrides:
defaults:
metrics_generator:
processors:
- service-graphs
- span-metrics
https://github.com/grafana/helm-charts/tree/main/charts/tempo-distributed#activate-metrics-generator
Describe the bug Good morning. We are running Grafana Tempo 2.5.0 using Bitnami Grafana Tempo distributed Helm chart in version: 3.6.3. The following components are deployed: Compactor, Distributed, Memcached, Metrics Generator, Querier, Query Frontend, Vulture and OpenTelemetry Collector in version: 0.83.0.
We have an issue after updating Grafana Tempo 2.1.1 to 2.5.0 with Metrics Generator. The issues with incompatible
vParquet
version has been mitigated according to documentation: https://grafana.com/docs/tempo/latest/setup/upgrade/#upgrade-to-tempo-25.The issue we are experiencing is that
metrics-generator
seems to be idle after start-up:This is the log from
metrics-generator
that has been started around 60 minutes before.The relevant configuration is as follows:
Yet, no service maps or span metrics are visible in Grafana. These metrics were correctly persisted in Prometheus via remote-write on Tempo 2.1.1 but broke after upgrading to 2.5.0. Also, internal metrics defined in the documentation page: https://grafana.com/docs/tempo/latest/troubleshooting/metrics-generator/ (first 3, 4) are reporting empty data in Prometheus - seems like they are not reported at all.
I think I've followed the documentation to enable the
processor
inmetrics_generator
yet, it's idling. Before, I was expecting something like:But I cannot see anything after joining cluster, nothing about connecting to Prometheus, nothing about not being able to connect to Prometheus remote-write, nothing.
Interestingly enough, running Grafana Tempo
./examples/docker-compose/distributed
I was able to see the metrics out of the box.To Reproduce: Steps to reproduce the behavior:
Expected behavior: Metrics Generator is not idle and metrics are visible in Prometheus metrics.
Environment:
v11.1.0 (5b85c4c2fc)