prometheus-community / stackdriver_exporter

Google Stackdriver Prometheus exporter
Apache License 2.0
258 stars 98 forks source link

GCP load balancer https/backend_latencies buckets issue #241

Open namm2 opened 1 year ago

namm2 commented 1 year ago

Hi, I'm using stackdriver-exporter v0.13.0, and only recently somebody noticed the https/backend_latencies have values on all buckets up to 4.410119471141699e+09 for example:

stackdriver_https_lb_rule_loadbalancing_googleapis_com_https_backend_latencies_bucket{backend_name="[redacted]",backend_scope="europe-west1-d",backend_scope_type="ZONE",backend_target_name="[redacted]",backend_target_type="BACKEND_SERVICE",backend_type="NETWORK_ENDPOINT_GROUP",cache_result="DISABLED",client_country="Belgium",forwarding_rule_name="[redacted]",matched_url_path_rule="/",project_id="[redacted]",protocol="HTTP/2.0",proxy_continent="Europe",region="global",response_code="200",response_code_class="200",target_proxy_name="[redacted]",unit="ms",url_map_name="[redacted]",le="4.410119471141699e+09"} 2 1688377920000

When I check GCP metrics explorer, the backend latency doesn't show these buckets so I guess there's something broken between GCP Cloud monitoring metrics and stackdriver-exporter.

I pulled the cloud monitoring metric as following example:

sampled ``` ListTimeSeriesPager

I cross-checked the document https://cloud.google.com/monitoring/api/ref_v3/rest/v3/TypedValue#distribution then these buckets look fine and these buckets are in low range of num_finite_buckets.

This leads to this line in stackdriver-exporter: https://github.com/prometheus-community/stackdriver_exporter/blob/be2625d7598866ef443ee7b61ecd7ed37e462eae/collectors/monitoring_collector.go#L546-L553

I suspect the last value from the previous calculation is now assigned to the rest of buckets instead of 0, so I print the buckets just to see its value to confirm the theory:

map[1:0 1.4:0 1.9599999999999997:0 2.7439999999999993:5 3.841599999999999:5 5.378239999999998:5 7.529535999999997:5 10.541350399999994:5 14.757890559999991:5 20.661046783999986:5 28.92546549759998:5 40.49565169663997:5 56.693912375295945:5 79.37147732541432:5 111.12006825558004:5 155.56809555781203:5 217.79533378093686:5 304.91346729331156:5 426.8788542106362:5 597.6303958948906:5 836.6825542528468:5 1171.3555759539854:5 1639.8978063355794:5 2295.856928869811:5 3214.199700417735:5 4499.879580584829:5 6299.83141281876:5 8819.763977946264:5 12347.669569124768:5 17286.73739677467:5 24201.43235548454:5 33882.00529767835:5 47434.807416749696:5 66408.73038344957:5 92972.22253682939:5 130161.11155156113:5 182225.55617218558:5 255115.7786410598:5 357162.09009748365:5 500026.9261364771:5 700037.696591068:5 980052.775227495:5 1.372073885318493e+06:5 1.92090343944589e+06:5 2.6892648152242457e+06:5 3.7649707413139436e+06:5 5.270959037839521e+06:5 7.379342652975327e+06:5 1.0331079714165458e+07:5 1.446351159983164e+07:5 2.0248916239764296e+07:5 2.8348482735670015e+07:5 3.968787582993802e+07:5 5.5563026161913216e+07:5 7.77882366266785e+07:5 1.0890353127734989e+08:5 1.5246494378828984e+08:5 2.1345092130360574e+08:5 2.98831289825048e+08:5 4.183638057550673e+08:5 5.857093280570941e+08:5 8.199930592799315e+08:5 1.147990282991904e+09:5 1.6071863961886654e+09:5 2.250060954664132e+09:5 3.1500853365297847e+09:5 4.410119471141699e+09:5 +Inf:5]

From our long term tsdb (thanos), the first data point that has this large bucket dates back to 2023-02-13 22:30:00 UTC. So I'm not sure what caused this issue with buckets distribution.

Does anybody have the same issue? And is my suspicion correct?