ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.13k stars 3.01k forks source link

An error has occurred while serving routing_http_client metrics #9891

Open lidel opened 1 year ago

lidel commented 1 year ago

Version

0.20.0

Config

"Routing": {
        "Methods": null,
        "Routers": null,
        "Type": "auto"
    },

Description

http://127.0.0.1:5001/debug/metrics/prometheus ends up in broken state after a multiple days of uptime:

An error has occurred while serving metrics:

14 error(s) occurred:
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values

Seems we have a bug in boxo/routing/http/client, needs analysis.

lidel commented 1 year ago

I was unable to reproduce this with vanilla Docker image for 0.20.0, but I can reproduce it every time with my own repo by asking for a CID without any providers or setting /etc/hosts to use 127.0.0.1 for cid.contact so it always fails:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values

Will try to poke more, want to confirm if this is or is not something that impacts ipfs.io bifrost-infra.

lidel commented 1 year ago

bifrost-infra box running 0.20 did not experience this, which was a good sign.

I've focused on the differences in config, and the problem only occurs if RPC API is exposed on more than one port. Each time we expose a listener, it re-register metrics, which then creates the surface for this bug.

TLDR:

Mayeu commented 9 months ago

Hello,

We are getting this error on our two nodes as well (running Kubo 0.24.0), but we don't have multiple listener setup:

"Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
}

When I restart kubo, there is a few seconds during which the metrics are printed, and then this error shows up.

Anything I can provide to help debug this?

Mayeu commented 9 months ago

Same situation with Kubo 0.25 (I did not yet tried 0.26 since I'm waiting for it to be released in the unstable branch of NixOS).

gammazero commented 2 months ago

This can be reproduced in v0.29.0 by:

ipfs init
edit ~/.ipfs/config
  "Addresses": {
    "API": ["/ip4/127.0.0.1/tcp/5001", "/ip6/::1/tcp/5001"],
ipfs daemon
echo hello | ipfs add
open http://127.0.0.1:5001/debug/metrics/prometheus