open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.41k forks source link

otelcol-contrib file_storage does not recover gracefully upon potentially corrupted database #35899

Open briandavis-viz opened 1 month ago

briandavis-viz commented 1 month ago

Component(s)

extension/storage/filestorage

What happened?

Description

When there are issues with underlying a linux guest storage, mostly in scenarios where the hypervisor loses connectivity or experiences extremely high latency with the NAS/SAN, the database can be in a corrupted state, which requires completely deleting the filesstorage queue, to allow otel to function (or even start running) again.

Steps to Reproduce

While a VM is running, try to remove it's storage beneath it, potentially in the middle of otel writing to the bolt filestorage.

This may also be caused by machines being improperly shut down (like power loss.)

Expected Result

Some data will be lost, but the filestorage extension will recover gracefully and be able to resume running.

Actual Result

The otelcol service attempts to start, but fails to do so when loading the filestorage extension. This requires manual intervention to delete the specific filestorage file that was corrupted, and then to start the service.

Collector version

0.106.1

Environment information

Environment

OS: Ubuntu 22.04 Compiler(if manually compiled): N/A, using https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.106.1/otelcol-contrib_0.106.1_linux_amd64.deb

OpenTelemetry Collector configuration

# This is a significantly reduced config (removed additional log receivers, filters, etc.) but all the required components are here, and all the items related to the file_storage extension, queues, etc. is untouched.
# ---------------------------------------------------- #
# ---------------  EXTENSIONS ------------------------ #
# ---------------------------------------------------- #
extensions:
  file_storage/fingerprint:
    directory: /var/lib/otelcol/fingerprint
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /data/otelcol/compaction
      check_interval: 5m
      rebound_trigger_threshold_mib: 50
      rebound_needed_threshold_mib: 500

  file_storage/persistent_queue_storage:
    directory: /data/otelcol/persistent_queue_storage
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /data/otelcol/compaction
      check_interval: 5m
      rebound_trigger_threshold_mib: 50
      rebound_needed_threshold_mib: 500

# ---------------------------------------------------- #
# -------------  RECEIVERS --------------------------- #
# ---------------------------------------------------- #
receivers:
  filelog/syslog:
    include: "/var/log/syslog"
    start_at: beginning
    include_file_path: true
    max_log_size: 100MiB
    storage: file_storage/fingerprint

  prometheus:
    trim_metric_suffixes: true
    config:
      scrape_configs:
      - job_name: 'prometheus-node-exporter'
        scrape_interval: 1m
        static_configs:
        - targets: ['127.0.0.1:9100']

# ---------------------------------------------------- #
# ------------  PROCESSORS --------------------------- #
# ---------------------------------------------------- #
processors:
  batch:
    send_batch_size: 10000
    timeout: 5s

# ---------------------------------------------------- #
# -------------  EXPORTERS --------------------------- #
# ---------------------------------------------------- #
exporters:
  logging: null

# from all normal logs
  otlphttp/general:
    endpoint: https://foo.bar.com:443
    timeout: 10s
    retry_on_failure:
      enabled: true
      initial_interval: 10s
      max_interval: 60s
      max_elapsed_time: 10m
    sending_queue:
      enabled: true
      storage: file_storage/persistent_queue_storage
      queue_size: 1000000000
      num_consumers: 2
    compression: gzip
    tls:
      insecure: false

# ---------------------------------------------------- #
# ---------------  SERVICE --------------------------- #
# ---------------------------------------------------- #
service:
  extensions:
    - "file_storage/fingerprint"
    - "file_storage/persistent_queue_storage"

  telemetry:
    logs:
      encoding: "json"
      disable_caller: true

  pipelines:
    logs/syslog:
      receivers:
        - "filelog/syslog"
      processors:
        - "batch"
      exporters:
        - "otlphttp/general"

    metrics:
      receivers:
        - "prometheus"
      processors:
        - "batch"
      exporters:
        - "otlphttp/general"

Log output

Oct 21 17:46:01 gw-0210027-02 systemd[1]: Started OpenTelemetry Collector Contrib.
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: time="2024-10-21T17:46:01+01:00" level=warning msg="DBUS_SESSION_BUS_ADDRESS envvar looks to be not set, this can lead to runaway dbus-daemon processes. T>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5634177,"msg":"Setting up own telemetry..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.563445,"msg":"OpenCensus bridge is disabled for Collector telemetry and will be removed in a future version, use --feature>
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5635238,"msg":"Serving metrics","address":":9103","metrics level":"Normal"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654778,"msg":"Starting otelcol-contrib...","Version":"0.106.1","NumCPU":6}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5654962,"msg":"Starting extensions..."}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655017,"msg":"Extension is starting...","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655134,"msg":"Extension started.","kind":"extension","name":"file_storage/persistent_queue_storage"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.565517,"msg":"Extension is starting...","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: {"level":"info","ts":1729529161.5655217,"msg":"Extension started.","kind":"extension","name":"file_storage/fingerprint"}
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: panic: assertion failed: Page expected to be: 242, but self identifies as 8316213599780548146
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: goroutine 1 [running]:
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt._assert(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:1387
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*page).fastCheck(0x7f795bd4c000, 0xf2)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/page.go:57 +0x1d9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).page(0xfd3f188?, 0x7412180?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:534 +0x7b
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x2, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:546 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPageInternal(0xc002450540, {0xc00225b3b0, 0x1, 0xa}, 0xc00353e020)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:555 +0xc8
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).forEachPage(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx.go:542
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc00187db80, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:83 +0x114
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket.func2({0x7f795bca6020?, 0xc00225b360?, 0xc00347fa78?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:110 +0x90
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Bucket).ForEachBucket(0x0?, 0xc00353e220)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/bucket.go:411 +0x96
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*Tx).checkBucket(0xc002450540, 0xc002450558, 0xc00353e428, 0xc00353e368, {0x9f50fa8, 0x10095f80}, 0xc002259b00)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/tx_check.go:108 +0x255
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).freepages(0xc002253448)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:1205 +0x225
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist.func1()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:417 +0xc5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).doSlow(0x3fc8bc0?, 0xc002253610?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         sync/once.go:74 +0xc2
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: sync.(*Once).Do(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         sync/once.go:65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.(*DB).loadFreelist(0xc002253448?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:413 +0x45
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.etcd.io/bbolt.Open({0xc00225b2c0, 0x45}, 0x180, 0xc00353e650)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.etcd.io/bbolt@v1.3.10/db.go:295 +0x430
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.newClient(0xc001fbab00, {0xc00225b2c0?, 0xc00353e7c0?}, 0x2540be400, 0xc003315600, 0xc0?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage@v0.106.1/client.go:53 +0xaf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage.(*localFileStorage).GetClient(0xc001bfd980, {0x0?, 0x0?}, 0x0?, {{{0xc003349010?, 0x0?}}, {0xc003349019?, 0x0?}}, {0x8d9233f, 0x4})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-contrib/extension/storage/filestorage@v0.106.1/extension.go:61 +0x3c5
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.toStorageClient({0x9f97628, 0x10095f80}, {{{0xc003243da0, 0xc}}, {0xc003243dad, 0x18}}, {0x9f40c68?, 0xc0034ae070?}, {{{0xc003349010, 0x8}}, ...}, ...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/persistent_queue.go:499 +0x106
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*persistentQueue[...]).Start(0xa02c220, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/persistent_queue.go:107 +0xa9
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start(0x9ee6610, {0x9f97628, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/consumers.go:31 +0x4c
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).Start(0xc0033680b0, {0x9f97628?, 0x10095f80?}, {0x9f40c68?, 0xc0034ae070?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/exporterhelper/queue_sender.go:103 +0x65
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).Start(0xc00337a160, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/exporter@v0.106.1/exporterhelper/common.go:322 +0xbf
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service/internal/graph.(*Graph).StartAll(0xc00336a180, {0x9f97628, 0x10095f80}, {0x9f40c68, 0xc0034ae070}, {0x9f71d50, 0xc001ecd110})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/service@v0.106.1/internal/graph/graph.go:425 +0x1cd
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/service.(*Service).Start(0xc003442000, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/service@v0.106.1/service.go:217 +0x309
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).setupConfigurationComponents(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/collector.go:218 +0x849
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.(*Collector).Run(0xc002dd97a0, {0x9f97628, 0x10095f80})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/collector.go:272 +0x55
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: go.opentelemetry.io/collector/otelcol.NewCommand.func1(0xc00244fb08, {0x8d9a719?, 0x7?, 0x8d9255f?})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         go.opentelemetry.io/collector/otelcol@v0.106.1/command.go:37 +0xa7
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).execute(0xc00244fb08, {0xc0001d4010, 0x1, 0x1})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:985 +0xaca
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).ExecuteC(0xc00244fb08)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:1117 +0x3ff
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: github.com/spf13/cobra.(*Command).Execute(0x916ac28?)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/spf13/cobra@v1.8.1/command.go:1041 +0x13
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.runInteractive({0x916ac28, {{0x8dc9f6c, 0xf}, {0x8e71b88, 0x1f}, {0x8d9a5a6, 0x7}}, 0x0, {{{0x0, 0x0, ...}, ...}}, ...})
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:53 +0x5d
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.run(...)
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main_others.go:10
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]: main.main()
Oct 21 17:46:01 gw-0210027-02 otelcol-contrib[1558919]:         github.com/open-telemetry/opentelemetry-collector-releases/contrib/main.go:46 +0x205
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Failed with result 'exit-code'.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: otelcol-contrib.service: Scheduled restart job, restart counter is at 5.
Oct 21 17:46:01 gw-0210027-02 systemd[1]: Stopped OpenTelemetry Collector Contrib.

Additional context

No response

github-actions[bot] commented 1 month ago

Pinging code owners: