grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.94k stars 3.45k forks source link

promtail: config reload fails #7734

Closed pschulten closed 1 year ago

pschulten commented 1 year ago

Describe the bug With 7247 added in 2.7.0 (https://github.com/grafana/loki/blob/d3111bcaa7749ca53902c45f4856e28e2980c79d/CHANGELOG.md?plain=1#LL99C3-L99C52) config reload should be possible but fails.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki 2.7.0
  2. Started Promtail 2.7.0 with config:
    
    server:
    http_listen_port: 9080
    grpc_listen_port: 0
    reload: true

positions: filename: /tmp/positions.yaml

clients:

scrape_configs:

Expected behavior Above config is parsed correctly

Environment:

Screenshots, Promtail config, or terminal output

Unable to parse config: /etc/promtail/promtail.yaml: yaml: unmarshal errors:
  line 4: field reload not found in type server.Config

Maybe related: https://github.com/grafana/loki/issues/6388

liguozhong commented 1 year ago

hi, reload filed is enable_runtime_reload

server:
  http_listen_port: 9080
  grpc_listen_port: 0
  enable_runtime_reload: true
liguozhong commented 1 year ago

https://grafana.com/docs/loki/latest/clients/promtail/configuration/

enable_runtime_reload

pschulten commented 1 year ago

thanks, that works. sorry :(

liuxuzxx commented 1 year ago

Describe the bug

  1. Start promtail version=2.9.0, branch=HEAD, revision=2feb64f69 in vm

  2. Started promtail version=2.9.0, branch=HEAD, revision=2feb64f69 in vm config.yaml is

    server:
    http_listen_port: 9080
    grpc_listen_port: 0
    enable_runtime_reload: true
    clients:
    - url: http://loki.cpaas.wxchina.com:30669/loki/api/v1/push
    positions:
    filename: ./positions.yaml
    target_config:
    sync_period: 10s
    scrape_configs:   
    - job_name: alert-log
    static_configs:
      - targets:
         - localhost
        labels:
          node: liuxu-node
          job: alert-log
          app: alert-app
          type: alert
          __path__: /media/liuxu/data/component/promtail/alert.log
  3. and modify config.yaml and save

    server:
    http_listen_port: 9080
    grpc_listen_port: 0
    enable_runtime_reload: true
    clients:
    - url: http://loki.cpaas.wxchina.com:30669/loki/api/v1/push
    positions:
    filename: ./positions.yaml
    target_config:
    sync_period: 10s
    scrape_configs:   
    - job_name: alert-log
    static_configs:
      - targets:
         - localhost
        labels:
          node: liuxu-node
          job: alert-log
          app: alert-app-liuxu  #modify the app label from alert-app to alert-app-liuxu
          type: alert
          __path__: /media/liuxu/data/component/promtail/alert.log
  4. execute curl reload

    curl http://localhost:9080/reload
  5. the error information: (I am in vm use vim modify the config.yaml and save)

    
    panic: duplicate metrics collector registration attempted

goroutine 209 [running]: github.com/prometheus/client_golang/prometheus.(Registry).MustRegister(0x37fd4eb?, {0xc000560e40?, 0x1, 0xb?}) /drone/src/vendor/github.com/prometheus/client_golang/prometheus/registry.go:405 +0x85 github.com/grafana/loki/clients/pkg/promtail/wal.NewWatcherMetrics({0x40e50d0, 0xc0000c6a00}) /drone/src/clients/pkg/promtail/wal/watcher_metrics.go:73 +0xaac github.com/grafana/loki/clients/pkg/promtail/client.NewManager(0x0?, {0x40c8600, 0xc000136e10}, {0x40c3880000000000, 0x2710, 0x0, 0x1, 0x0, 0x0, 0x0}, ...) /drone/src/clients/pkg/promtail/client/manager.go:61 +0x8f github.com/grafana/loki/clients/pkg/promtail.(Promtail).reloadConfig(0xc000c1f2c0, 0xc0005f6000) /drone/src/clients/pkg/promtail/promtail.go:170 +0x8cb github.com/grafana/loki/clients/pkg/promtail.(Promtail).reload(0xc000c1f2c0) /drone/src/clients/pkg/promtail/promtail.go:286 +0xaf github.com/grafana/loki/clients/pkg/promtail.(Promtail).watchConfig(0xc000c1f2c0) /drone/src/clients/pkg/promtail/promtail.go:271 +0x3e9 created by github.com/grafana/loki/clients/pkg/promtail.(*Promtail).Run /drone/src/clients/pkg/promtail/promtail.go:214 +0xcc

liuxuzxx commented 1 year ago

是我的操作流程上有问题?

liuxuzxx commented 1 year ago

When I use promtail-2.7.0 the reload is ok!

the version detail information: version=HEAD-1b627d8, branch=HEAD, revision=1b627d880

liuxuzxx commented 1 year ago

The version : version=2.8.2, branch=HEAD, revision=9f809eda7 is OK!

liuxuzxx commented 1 year ago

The version: version=2.9.2, branch=HEAD, revision=a17308db6 is not OK!

liuxuzxx commented 1 year ago

The version: version=2.8.6, branch=HEAD, revision=990ac685e is OK!

liuxuzxx commented 1 year ago

I know the 2.9.x version's watch_metrics.go

func NewWatcherMetrics(reg prometheus.Registerer) *WatcherMetrics {
    m := &WatcherMetrics{
        recordsRead: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "records_read_total",
                Help:      "Number of records read by the WAL watcher from the WAL.",
            },
            []string{"id"},
        ),
        recordDecodeFails: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "record_decode_failures_total",
                Help:      "Number of records read by the WAL watcher that resulted in an error when decoding.",
            },
            []string{"id"},
        ),
        droppedWriteNotifications: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "dropped_write_notifications_total",
                Help:      "Number of dropped write notifications due to having one already buffered.",
            },
            []string{"id"},
        ),
        segmentRead: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "segment_read_total",
                Help:      "Number of segment reads triggered by the backup timer firing.",
            },
            []string{"id", "reason"},
        ),
        currentSegment: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "current_segment",
                Help:      "Current segment the WAL watcher is reading records from.",
            },
            []string{"id"},
        ),
        watchersRunning: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "running",
                Help:      "Number of WAL watchers running.",
            },
            nil,
        ),
    }

    if reg != nil {
        reg.MustRegister(m.recordsRead)
        reg.MustRegister(m.recordDecodeFails)
        reg.MustRegister(m.droppedWriteNotifications)
        reg.MustRegister(m.segmentRead)
        reg.MustRegister(m.currentSegment)
        reg.MustRegister(m.watchersRunning)
    }

    return m
}

but the main branch code is:

func NewWatcherMetrics(reg prometheus.Registerer) *WatcherMetrics {
    m := &WatcherMetrics{
        recordsRead: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "records_read_total",
                Help:      "Number of records read by the WAL watcher from the WAL.",
            },
            []string{"id"},
        ),
        recordDecodeFails: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "record_decode_failures_total",
                Help:      "Number of records read by the WAL watcher that resulted in an error when decoding.",
            },
            []string{"id"},
        ),
        droppedWriteNotifications: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "dropped_write_notifications_total",
                Help:      "Number of dropped write notifications due to having one already buffered.",
            },
            []string{"id"},
        ),
        segmentRead: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "segment_read_total",
                Help:      "Number of segment reads triggered by the backup timer firing.",
            },
            []string{"id", "reason"},
        ),
        currentSegment: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "current_segment",
                Help:      "Current segment the WAL watcher is reading records from.",
            },
            []string{"id"},
        ),
        watchersRunning: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Namespace: "loki",
                Subsystem: "wal_watcher",
                Name:      "running",
                Help:      "Number of WAL watchers running.",
            },
            nil,
        ),
    }

    // Collectors will be re-registered to registry if it's got reloaded
    // Reuse the old collectors instead of panicking out.
    if reg != nil {
        if err := reg.Register(m.recordsRead); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.recordsRead = are.ExistingCollector.(*prometheus.CounterVec)
            }
        }
        if err := reg.Register(m.recordDecodeFails); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.recordDecodeFails = are.ExistingCollector.(*prometheus.CounterVec)
            }
        }
        if err := reg.Register(m.droppedWriteNotifications); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.droppedWriteNotifications = are.ExistingCollector.(*prometheus.CounterVec)
            }
        }
        if err := reg.Register(m.segmentRead); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.segmentRead = are.ExistingCollector.(*prometheus.CounterVec)
            }
        }
        if err := reg.Register(m.currentSegment); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.currentSegment = are.ExistingCollector.(*prometheus.GaugeVec)
            }
        }
        if err := reg.Register(m.watchersRunning); err != nil {
            are := &prometheus.AlreadyRegisteredError{}
            if errors.As(err, are) {
                m.watchersRunning = are.ExistingCollector.(*prometheus.GaugeVec)
            }
        }
    }

    return m
}
liuxuzxx commented 1 year ago

How to resolve the problem is version v2.9.x reload error ? Waiting for a new version to be released? Waiting main branch merge to v2.9.3 version?

liuxuzxx commented 1 year ago

@liguozhong Help me!

liuxuzxx commented 1 year ago

@hainenber Hi you resolve the reload register metrics panic. and When will the new version be merged and released?