open-telemetry / prometheus-interoperability-spec

Workgroup for building Prometheus-OTLP interoperability for the OTEL Collector and Prometheus related discussions.
Apache License 2.0
41 stars 6 forks source link

Remote write compliance: TestRemoteWrite/otelcollector/RepeatedLabels #44

Closed rakyll closed 3 years ago

rakyll commented 3 years ago

The OpenTelemetry collector is not passing https://github.com/prometheus/compliance/tree/main/remote_write TestRemoteWrite/otelcollector/RepeatedLabels.

=== CONT  TestRemoteWrite/otelcollector/RepeatedLabels
    labels.go:58:
            Error Trace:    labels.go:58
                                        main_test.go:101
                                        main_test.go:65
            Error:          Should be true
            Test:           TestRemoteWrite/otelcollector/RepeatedLabels
            Messages:       found zero samples for up{job="test"} = 0
rakyll commented 3 years ago

Related to https://github.com/open-telemetry/wg-prometheus/issues/41.

Aneurysm9 commented 3 years ago

This continues to fail even with the up metric present:

=== CONT  TestRemoteWrite/otelcollector/RepeatedLabels
    labels.go:58:
                Error Trace:    labels.go:58
                                                        main_test.go:105
                                                        main_test.go:69
                Error:          Should be true
                Test:           TestRemoteWrite/otelcollector/RepeatedLabels
                Messages:       found samples for test{a="1"}, none expected
odeke-em commented 3 years ago

Please assign this issue to me

alolita commented 3 years ago

All yours @odeke-em

odeke-em commented 3 years ago

I think the Prometheus Remote Write compliance tests are wrong. I have written a sample test and used Prometheus with it and Prometheus is able to successfully get the datapoints.

main.go

package main

import (
    "context"
    "fmt"
    "net/http"
    "os"
    "os/signal"
    "sync/atomic"
    "time"
)

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    i := uint64(1)

    go func() {
        for {
            select {
            case <-ctx.Done():
                return
            case <-time.After(27 * time.Second):
                atomic.AddUint64(&i, 1)
            }
        }
    }()

    handler := func(rw http.ResponseWriter, req *http.Request) {
        println("Scraped", time.Now().Unix())
        fmt.Fprintf(rw, `
# HELP test A gauge
# TYPE test gauge
test{a="1",a="1"} %.1f`, float64(atomic.LoadUint64(&i)))
    }

    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt)

    defer cancel()

    go func() {
        if err := http.ListenAndServe(":8855", http.HandlerFunc(handler)); err != nil {
            panic(err)
        }
    }()
    <-sigCh
}

Run it by go run main.go

prom.yaml

global:
  scrape_interval: 10s

  external_labels:
    monitor: 'otlc'

remote_write:

scrape_configs:
  - job_name: 'otlc'

    scrape_interval: 10s

    honor_labels: true

    static_configs:
        - targets: ['localhost:8855']

Run it by prometheus --config.file=prom.yaml --log.level=debug

Viewed on Prometheus

Screen Shot 2021-06-09 at 1 45 02 PM Screen Shot 2021-06-09 at 1 45 16 PM

Synopsis

Prometheus itself doesn't reject these repeated labels, so why are the compliance tests rejecting it? Kindly cc-ing @tomwilkie @brian-brazil @rakyll @Aneurysm9 @alolita.

brian-brazil commented 3 years ago

Prometheus itself doesn't reject these repeated labels

If that's the case, then that's a bug in Prometheus. Are you sure you're using the latest version? We had to fix this at one point.

roidelapluie commented 3 years ago

We reject them, in TSDB:

https://github.com/prometheus/prometheus/blob/7bc11dcb06640ce22b4e15eb52b2c065f97cf79a/tsdb/head.go#L1374-L1376

roidelapluie commented 3 years ago

Your Prometheus screenshot seems from a release before Prometheus 2.14, because it is the "old" UI without a link to the new UI. That link was added in 2.14. I guess you are using a Prometheus release that was released before 2019-11-11.

Duplicate labels are rejected since 2.16.0, released on 2020-02-13.

odeke-em commented 3 years ago

Oh I see I see! Thank you Julien and Brian for the correction. I had installed Prometheus by a go get a while ago and hadn’t updated. Let me update and try again.

On Wed, Jun 9, 2021 at 2:39 PM Julien Pivotto @.***> wrote:

Your Prometheus screenshot seems from a release before Prometheus 2.14, because it is the "old" UI without a link to the new UI. That link was added in 2.14. I guess you are using a Prometheus release that was released before 2019-11-11.

Duplicate labels are rejected since 2.16.0, released on 2020-02-13.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-telemetry/wg-prometheus/issues/44#issuecomment-857621823, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFL3V7XTZOWKR7LE6B4RRLTR5HFJANCNFSM43HFKZAA .

odeke-em commented 3 years ago

Yup, with the latest, I now see the issue

level=debug ts=2021-06-09T12:45:39.268Z caller=scrape.go:1459 component="scrape manager" scrape_pool=otlc target=http://localhost:8855/metrics msg="Unexpected error" series="test{a=\"1\",a=\"1\"}" err="label name \"a\" is not unique: invalid sample"
level=debug ts=2021-06-09T12:45:39.268Z caller=scrape.go:1247 component="scrape manager" scrape_pool=otlc target=http://localhost:8855/metrics msg="Append failed" err="label name \"a\" is not unique: invalid sample"

Thank you Brian and Julien!

odeke-em commented 3 years ago

This issue was a red herring, the problem is actually in the Prometheus receiver and I've filed https://github.com/open-telemetry/opentelemetry-collector/issues/3407 and am mailing a PR soon after the Prometheus Working Group meeting that's currently ongoing.