Broker ingresses and filters silently stop exposing metrics

antoineco commented 3 years ago

Describe the bug

At some point, broker ingresses and filters start showing the following error messages at :9092/metrics instead of showing metrics.

Ingress

An error has occurred while serving metrics:
24 error(s) occurred:
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > counter:<value:1 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_dispatch_latencies" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > histogram:<sample_count:1 sample_sum:28 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user2" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > counter:<value:3 > } was collected before with the same name and label values
* collected metric "mt_broker_ingress_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user2" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-67ff9c869b-n88c8e5a848d2c97c277c43d81512e6cdf" > histogram:<sample_count:3 sample_sum:16231.000000000002 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:2 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > > } was collected before with the same name and label values

...

(24 times with different namespace_name)

I noticed this always happens with the following tags, although I'm currently sending a lot more event types than that:

"event_type" value:"dev.knative.sources.ping"

Filter

An error has occurred while serving metrics:

104 error(s) occurred:
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:17 > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"200" > label:<name:"response_code_class" value:"2xx" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:17 sample_sum:134 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:9 upper_bound:5 > bucket:<cumulative_count:13 upper_bound:10 > bucket:<cumulative_count:15 upper_bound:20 > bucket:<cumulative_count:17 upper_bound:50 > bucket:<cumulative_count:17 upper_bound:100 > bucket:<cumulative_count:17 upper_bound:200 > bucket:<cumulative_count:17 upper_bound:500 > bucket:<cumulative_count:17 upper_bound:1000 > bucket:<cumulative_count:17 upper_bound:2000 > bucket:<cumulative_count:17 upper_bound:5000 > bucket:<cumulative_count:17 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user1" > label:<name:"trigger_name" value:"sockeye" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:17 sample_sum:75 bucket:<cumulative_count:10 upper_bound:1 > bucket:<cumulative_count:10 upper_bound:2 > bucket:<cumulative_count:10 upper_bound:5 > bucket:<cumulative_count:12 upper_bound:10 > bucket:<cumulative_count:16 upper_bound:20 > bucket:<cumulative_count:17 upper_bound:50 > bucket:<cumulative_count:17 upper_bound:100 > bucket:<cumulative_count:17 upper_bound:200 > bucket:<cumulative_count:17 upper_bound:500 > bucket:<cumulative_count:17 upper_bound:1000 > bucket:<cumulative_count:17 upper_bound:2000 > bucket:<cumulative_count:17 upper_bound:5000 > bucket:<cumulative_count:17 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user2" > label:<name:"trigger_name" value:"go-lambda-trigger" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:2 sample_sum:13 bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > bucket:<cumulative_count:2 upper_bound:20 > bucket:<cumulative_count:2 upper_bound:50 > bucket:<cumulative_count:2 upper_bound:100 > bucket:<cumulative_count:2 upper_bound:200 > bucket:<cumulative_count:2 upper_bound:500 > bucket:<cumulative_count:2 upper_bound:1000 > bucket:<cumulative_count:2 upper_bound:2000 > bucket:<cumulative_count:2 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:4 > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_dispatch_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:4 sample_sum:130 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:4 upper_bound:50 > bucket:<cumulative_count:4 upper_bound:100 > bucket:<cumulative_count:4 upper_bound:200 > bucket:<cumulative_count:4 upper_bound:500 > bucket:<cumulative_count:4 upper_bound:1000 > bucket:<cumulative_count:4 upper_bound:2000 > bucket:<cumulative_count:4 upper_bound:5000 > bucket:<cumulative_count:4 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_processing_latencies" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > histogram:<sample_count:4 sample_sum:47.99999999999999 bucket:<cumulative_count:2 upper_bound:1 > bucket:<cumulative_count:2 upper_bound:2 > bucket:<cumulative_count:3 upper_bound:5 > bucket:<cumulative_count:3 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:4 upper_bound:50 > bucket:<cumulative_count:4 upper_bound:100 > bucket:<cumulative_count:4 upper_bound:200 > bucket:<cumulative_count:4 upper_bound:500 > bucket:<cumulative_count:4 upper_bound:1000 > bucket:<cumulative_count:4 upper_bound:2000 > bucket:<cumulative_count:4 upper_bound:5000 > bucket:<cumulative_count:4 upper_bound:10000 > > } was collected before with the same name and label values
* collected metric "mt_broker_filter_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"filter" > label:<name:"filter_type" value:"any" > label:<name:"namespace_name" value:"user3" > label:<name:"response_code" value:"502" > label:<name:"response_code_class" value:"5xx" > label:<name:"trigger_name" value:"tekton" > label:<name:"unique_name" value:"mt-broker-filter-56964c97dd-5d5ce54f1b5276f60c0a4ee8462d7f216b7" > counter:<value:1 > } was collected before with the same name and label values

...

This becomes obvious when looking at a Grafana dashboard that uses those metrics. Below, I'm supposed to see 10 replicas, and a rate of exactly 10K event/sec, but only 4, then 3, then 2... then 0 replica(s) report metrics:

Expected behavior

No failure.

To Reproduce

Create a few ping sources I suppose (we currently have 19 in this cluster, spread across multiple namespaces), then generate some load on the ingress, e.g. using vegeta.

Knative release version

Eventing v0.19.2

Additional context

antoineco commented 3 years ago

/assign

antoineco commented 3 years ago

I tried updating github.com/prometheus/client_golang to v1.8.0 today, just to ensure the problem was still happening with the latest version, and unfortunately it does.

I'll keep investigating.

antoineco commented 3 years ago

I can see the issue occur by simply adding a few print statements in the the file below:

diff --git a/vendor/github.com/prometheus/client_golang/prometheus/registry.go b/vendor/github.com/prometheus/client_golang/prometheus/registry.go
index b5e70b93f..6ff13f2ca 100644
--- a/vendor/github.com/prometheus/client_golang/prometheus/registry.go
+++ b/vendor/github.com/prometheus/client_golang/prometheus/registry.go
@@ -477,6 +477,7 @@ func (r *Registry) Gather() ([]*dto.MetricFamily, error) {
                                cmc = nil
                                break
                        }
+                       fmt.Println("+ Processing checked metric")
                        errs.Append(processMetric(
                                metric, metricFamiliesByName,
                                metricHashes,
@@ -487,6 +488,7 @@ func (r *Registry) Gather() ([]*dto.MetricFamily, error) {
                                umc = nil
                                break
                        }
+                       fmt.Println("+ Processing unchecked metric")
                        errs.Append(processMetric(
                                metric, metricFamiliesByName,
                                metricHashes,
@@ -504,6 +506,7 @@ func (r *Registry) Gather() ([]*dto.MetricFamily, error) {
                                                cmc = nil
                                                break
                                        }
+                                       fmt.Println("+ Processing checked metric")
                                        errs.Append(processMetric(
                                                metric, metricFamiliesByName,
                                                metricHashes,
@@ -514,6 +517,7 @@ func (r *Registry) Gather() ([]*dto.MetricFamily, error) {
                                                umc = nil
                                                break
                                        }
+                                       fmt.Println("+ Processing unchecked metric")
                                        errs.Append(processMetric(
                                                metric, metricFamiliesByName,
                                                metricHashes,
@@ -586,6 +590,7 @@ func processMetric(
        if err := metric.Write(dtoMetric); err != nil {
                return fmt.Errorf("error collecting metric %v: %s", desc, err)
        }
+       fmt.Printf(": %+v\n", *dtoMetric)
        metricFamily, ok := metricFamiliesByName[desc.fqName]
        if ok { // Existing name.
                if metricFamily.GetHelp() != desc.help {
@@ -886,6 +891,7 @@ func checkMetricConsistency(
                h = hashAddByte(h, separatorByte)
        }
        if _, exists := metricHashes[h]; exists {
+               fmt.Println("BOOM", *dtoMetric)
                return fmt.Errorf(
                        "collected metric %q { %s} was collected before with the same name and label values",
                        name, dtoMetric,

After about 3 min the first duplicate metric appeared. I got lucky because it took another 30 min for the second ingress replica to fail. Same metric.

The 2 cases are shown below:

+ Processing unchecked metric

(seen once)
: {Label:[name:"broker_name" value:"events"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user1"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-pwa3654bd9790cc8f3574913c8174e70cc" ] Gauge:<nil> Counter:<nil> Summary:<nil> Untyped:<nil> Histogram:sample_count:1 sample_sum:18 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:1 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:10000 >  TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen once)
: {Label:[name:"broker_name" value:"events"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user1"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-pwa3654bd9790cc8f3574913c8174e70cc" ] Gauge:<nil> Counter:value:1  Summary:<nil> Untyped:<nil> Histogram:<nil> TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen twice -> BOOM)
: {Label:[name:"broker_name" value:"events"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user1"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-pwa3654bd9790cc8f3574913c8174e70cc" ] Gauge:<nil> Counter:value:1  Summary:<nil> Untyped:<nil> Histogram:<nil> TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen twice -> BOOM)
: {Label:[name:"broker_name" value:"events"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user1"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-pwa3654bd9790cc8f3574913c8174e70cc" ] Gauge:<nil> Counter:<nil> Summary:<nil> Untyped:<nil> Histogram:sample_count:1 sample_sum:16 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:1 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:10000 >  TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen once)
: {Label:[name:"broker_name" value:"default"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user2"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-bcee668ec656bb16aea3ab1bce5d0db3cf" ] Gauge:<nil> Counter:value:8  Summary:<nil> Untyped:<nil> Histogram:<nil> TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen once)
: {Label:[name:"broker_name" value:"default"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user2"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-bcee668ec656bb16aea3ab1bce5d0db3cf" ] Gauge:<nil> Counter:<nil> Summary:<nil> Untyped:<nil> Histogram:sample_count:8 sample_sum:55.00000000000001 bucket:<cumulative_count:4 upper_bound:1 > bucket:<cumulative_count:4 upper_bound:2 > bucket:<cumulative_count:5 upper_bound:5 > bucket:<cumulative_count:5 upper_bound:10 > bucket:<cumulative_count:8 upper_bound:20 > bucket:<cumulative_count:8 upper_bound:50 > bucket:<cumulative_count:8 upper_bound:100 > bucket:<cumulative_count:8 upper_bound:200 > bucket:<cumulative_count:8 upper_bound:500 > bucket:<cumulative_count:8 upper_bound:1000 > bucket:<cumulative_count:8 upper_bound:2000 > bucket:<cumulative_count:8 upper_bound:5000 > bucket:<cumulative_count:8 upper_bound:10000 >  TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen twice -> BOOM)
: {Label:[name:"broker_name" value:"default"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user2"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-bcee668ec656bb16aea3ab1bce5d0db3cf" ] Gauge:<nil> Counter:value:1  Summary:<nil> Untyped:<nil> Histogram:<nil> TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

+ Processing unchecked metric

(seen twice -> BOOM)
: {Label:[name:"broker_name" value:"default"  name:"container_name" value:"ingress"  name:"event_type" value:"dev.knative.sources.ping"  name:"namespace_name" value:"user2"  name:"response_code" value:"202"  name:"response_code_class" value:"2xx"  name:"unique_name" value:"mt-broker-ingress-56d9cf8689-bcee668ec656bb16aea3ab1bce5d0db3cf" ] Gauge:<nil> Counter:<nil> Summary:<nil> Untyped:<nil> Histogram:sample_count:1 sample_sum:18 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:1 upper_bound:20 > bucket:<cumulative_count:1 upper_bound:50 > bucket:<cumulative_count:1 upper_bound:100 > bucket:<cumulative_count:1 upper_bound:200 > bucket:<cumulative_count:1 upper_bound:500 > bucket:<cumulative_count:1 upper_bound:1000 > bucket:<cumulative_count:1 upper_bound:2000 > bucket:<cumulative_count:1 upper_bound:5000 > bucket:<cumulative_count:1 upper_bound:10000 >  TimestampMs:<nil> XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

Once a conflict has occurred once, it will occur again every time this metric goes through processMetric(), until the process gets terminated. E.g. for one of the metrics above, I will always see 2 occurrences closely following each other with a "BOOM" on the second occurence.

I find the second log interesting because, contrary to the first log, the metric value is different when the failure occurs. Take the example of the counter. I see its value increase normally in the logs (value: 1, 2, 3, ..., 8) over 30 min, then suddenly a value of 1 appears.

I'm not yet able to tell whether this is relevant for this issue though.

antoineco commented 3 years ago

This issue never occurs with the channel dispatcher (to me). The only difference I see in the Broker's code in terms of stats reporting is this

https://github.com/knative/eventing/blob/1e06e5ce4c64d0f31e7f687fff942493e35fe060/pkg/mtbroker/ingress/stats_reporter.go#L144-L152

versus just

https://github.com/knative/eventing/blob/1e06e5ce4c64d0f31e7f687fff942493e35fe060/pkg/channel/stats_reporter.go#L143-L144

This causes the ros variable to contain an extra stats.Options containing the Broker's info in the code below, but it's not obvious how/if this causes the breakage:

https://github.com/knative/eventing/blob/1e06e5ce4c64d0f31e7f687fff942493e35fe060/vendor/knative.dev/pkg/metrics/config.go#L170-L178

By adding some more print statements, I can see that the function above is called concurrently, which I expected since events are processed concurrently (this is also happening in the dispatcher, without causing any failure):

diff --git a/vendor/knative.dev/pkg/metrics/config.go b/vendor/knative.dev/pkg/metrics/config.go
index d4ae24ad5..cd1cdcbdb 100644
--- a/vendor/knative.dev/pkg/metrics/config.go
+++ b/vendor/knative.dev/pkg/metrics/config.go
@@ -174,6 +174,12 @@ func (mc *metricsConfig) record(ctx context.Context, mss []stats.Measurement, ro
                }
                ros = append(ros, opt)

+               fmt.Println("+ ros")
+               fmt.Println(":", ros)
+
+               fmt.Println("+ mss")
+               fmt.Println(":", mss[0])
+
                return stats.RecordWithOptions(ctx, append(ros, stats.WithMeasurements(mss...))...)
        }
        return mc.recorder(ctx, mss, ros...)

+ ros
: [0x13990a0]
+ mss
: {0 0xc0000fc430 0xc00056b280}
+ ros
: [0x13990a0]
+ mss
: {1 0xc0000fc428 0xc00056b240}

[...] interleaved "print"

+ ros
: [0x13990a0]
+ mss
+ ros
: [0x13990a0]
: {0 0xc0000fc430 0xc00056b280}

[...] 2 consecutive "print mss"

+ ros
: [0x13990a0]
+ mss
: {1 0xc0000fc428 0xc00056b240}
+ mss
: {1 0xc0000fc428 0xc00056b240}

+ ros
: [0x13990a0]
+ mss
: {1 0xc0000fc430 0xc00056b280}
...

antoineco commented 3 years ago

After a sufficient amount of time, I sometimes get more than 2 occurrences of the same metric with different values. For example, this namespace has 2 ping sources (same schedule, different brokers), and for each ping source I see 3 or event 4 occurrences of the error, with different counter values:

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:350 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:350 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:1 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:19 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"events" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:6 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:1 > } was collected before with the same name and label values

* collected metric "mt_broker_ingress_event_count" { label:<name:"broker_name" value:"default" > label:<name:"container_name" value:"ingress" > label:<name:"event_type" value:"dev.knative.sources.ping" > label:<name:"namespace_name" value:"user1" > label:<name:"response_code" value:"202" > label:<name:"response_code_class" value:"2xx" > label:<name:"unique_name" value:"mt-broker-ingress-579fd5948b-920d6c5398701cf378bb2a0ed912fb0002" > counter:<value:20 > } was collected before with the same name and label values

antoineco commented 3 years ago

I noticed something interesting regarding ping sources, since breakages always happen with ping events. About every 10 minutes I see this happening:

$ kubectl -n user1 get pingsources.sources.knative.dev hello -w
NAME    SINK                                                                          SCHEDULE      AGE   READY   REASON
hello   http://broker-ingress.knative-eventing.svc.cluster.local/user1/default   */1 * * * *   33d   True

(few minutes...)

hello   http://broker-ingress.knative-eventing.svc.cluster.local/user1/default   */1 * * * *   33d   False   MinimumReplicasUnavailable
hello   http://broker-ingress.knative-eventing.svc.cluster.local/user1/default   */1 * * * *   33d   True

I noticed the pingsource adapter Pod was replaced regularly:

$ kubectl -n knative-eventing get pod -l eventing.knative.dev/source=ping-source-controller -w
NAME                                     READY   STATUS      RESTARTS   AGE
pingsource-mt-adapter-85d59c8666-sq2dn   1/1     Terminating   0          8m46s
pingsource-mt-adapter-85d59c8666-h887f   0/1     Pending       0          0s
pingsource-mt-adapter-85d59c8666-h887f   1/1     Running       0          6s

I'm not sure how this influences metrics, but that's what I'm not going the explore next.

lionelvillard commented 3 years ago

Can you try to capture the adapter logs? Maybe the lease does not get renewed, or a crash?

How come you have two pods with the exact same name?

antoineco commented 3 years ago

@lionelvillard I don't, I used get -w.

I could but let's do that in another issue then (I think it's the operator scaling the adapter to 0).

Back to the topic, given the information above, do you have any idea why the ping source (and only this source apparently) could cause Prometheus metrics to become inconsistent? I've spent 3 entire days digging and writing tests to reproduce without success, but after a few minutes or hours it always occurs in "prod" (in a cluster that currently runs ping sources exclusively).

antoineco commented 3 years ago

Since I haven't been successful in my attempt to forcibly cause the issue in a test, I'll just dump a manifest with the entire list of ping sources running in the affected cluster (together with namespaces + brokers for convenience), this way anyone can try and observe what happens to the metrics endpoint of the Broker's ingress after some time.

cluster-dump-issue-4645.yaml.gz

antoineco commented 3 years ago

This was supposedly fixed in 0.20 by knative/pkg#2005. I will test and report.

lberk commented 3 years ago

@antoineco any results from your tests?

antoineco commented 3 years ago

@lberk still occuring on v0.21.0

lberk commented 3 years ago

@antoineco are you still working on this? I moved this to the v0.23 milestone for now, if you don't have any spare cycles (just looking at the dates on this issue), would you mind unassigning yourself and perhaps moving the issue to the backlog? thanks!

antoineco commented 3 years ago

I never managed to find a way to intentionally reproduce this issue, despite spending 3 entire days on it, so eventually I gave up. All I know is that metrics will break after up to 4 hours after a restart, consistently (more often earlier than that).

I think we should close otherwise it's gonna rot here.

matzew commented 3 years ago

@skonto do you have some cycles to look at this ?

skonto commented 3 years ago

@matzew afaik this is the same as these: https://github.com/knative/serving/issues/11019 https://github.com/knative/serving/issues/10421 https://github.com/knative/serving/issues/10906

I have also seen this before downstream.

Afaik it is fixed with: https://github.com/knative/pkg/commit/6c2f94a4d6a0328d63f5fa16f7890d8e4516a2d0. The fix is available in 0.22+

matzew commented 3 years ago

@skonto thanks a lot! You think it's worth to backport the fix to 0.21 ?

skonto commented 3 years ago

@matzew I guess it is otherwise brokers may fail (although restarts fixes it as I remember from the failure in in my local env). I can create a PR.

knative / eventing

Broker ingresses and filters silently stop exposing metrics #4645