An error has occurred while serving routing_http_client metrics

lidel commented 1 year ago

Version

0.20.0

Config

"Routing": {
        "Methods": null,
        "Routers": null,
        "Type": "auto"
    },

Description

http://127.0.0.1:5001/debug/metrics/prometheus ends up in broken state after a multiple days of uptime:

An error has occurred while serving metrics:

14 error(s) occurred:
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values

Seems we have a bug in boxo/routing/http/client, needs analysis.

lidel commented 1 year ago

I was unable to reproduce this with vanilla Docker image for 0.20.0, but I can reproduce it every time with my own repo by asking for a CID without any providers or setting /etc/hosts to use 127.0.0.1 for cid.contact so it always fails:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values

Will try to poke more, want to confirm if this is or is not something that impacts ipfs.io bifrost-infra.

lidel commented 1 year ago

bifrost-infra box running 0.20 did not experience this, which was a good sign.

I've focused on the differences in config, and the problem only occurs if RPC API is exposed on more than one port. Each time we expose a listener, it re-register metrics, which then creates the surface for this bug.

TLDR:

The bug can be reproduced every time ONLY with custom config that has more than one listener defined in Addresses.API
There is no bug with the default config, so moving this to P2.

Mayeu commented 9 months ago

Hello,

We are getting this error on our two nodes as well (running Kubo 0.24.0), but we don't have multiple listener setup:

"Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
}

When I restart kubo, there is a few seconds during which the metrics are printed, and then this error shows up.

Anything I can provide to help debug this?

Mayeu commented 9 months ago

Same situation with Kubo 0.25 (I did not yet tried 0.26 since I'm waiting for it to be released in the unstable branch of NixOS).

gammazero commented 2 months ago

This can be reproduced in v0.29.0 by:

ipfs init
edit ~/.ipfs/config

  "Addresses": {
    "API": ["/ip4/127.0.0.1/tcp/5001", "/ip6/::1/tcp/5001"],

ipfs daemon
echo hello | ipfs add
open http://127.0.0.1:5001/debug/metrics/prometheus

ipfs / kubo