open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.42k stars 1.46k forks source link

Persistent queue fails to delete items from disk in some circumstances #8115

Open swiatekm opened 1 year ago

swiatekm commented 1 year ago

Describe the bug The persistent queue pulls items from the on-disk database, but fails to delete them after they've been successfully sent. The result is that the db has orphaned data in it that never gets deleted.

Steps to reproduce I haven't been able to reproduce this, but I have had a user send me their DB file, and have confirmed it contained items with indices less than the saved read index.

What version did you use? 0.77.0

What config did you use? The relevant part is:

extensions:
  file_storage:
    directory: /var/lib/storage/otc
    timeout: 10s
    compaction:
      on_rebound: true
      directory: /tmp

Environment The report I have is from AWS EKS 1.23 with the default storage driver.

Additional context

7396 should technically help address similar problems, but this has happened quite consistently for this specific user in low disk space conditions, and the issue that change addresses should be fairly difficult to trigger.

I'm filing this in the hope that I can get more reports of this behaviour to help verify that it still occurs, and hopefully get to the bottom of the issue.

swiatekm commented 1 year ago

@jpkrohling @djaglowski @frzifus you asked to be included during yesterday's SIG meeting.

johngmyers commented 2 months ago

We're seeing something that looks like this with splunk-otel-collector 0.102.0, which claims to have changes from opentelemetry-collector 0.102.0. The persistent queue file is upwards of 100GB, yet the Prometheus metrics show no backlog.

swiatekm commented 2 months ago

@johngmyers do you have access to the queue file? It's possible to inspect it with bbolt's CLI and check if there are keys outside of the queue's normal range.

johngmyers commented 2 months ago
$ sudo ~/bbolt keys exporter_splunk_hec_platform_logs_logs default
254468
254469
254470
254471
254472
254473
[...]
263190
263191
263192
di
ri
wi
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default di
0300000004e203000000000005e203000000000006e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default ri
07e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default wi
1904040000000000
johngmyers commented 2 months ago

ri decodes to 254471, so there are three keys before that.

The numbers in di are 3, 254468, 0, 254469, 0, 254470, 0

johngmyers commented 2 months ago

This particular pod is in CrashLoopBackoff, from the container exiting with code 2. Log lines preceding the exit are about it being over the hard memory limit.

johngmyers commented 2 months ago

I think the livenessProbe was configured too aggressively.

johngmyers commented 2 months ago

After giving the liveness probe an initial delay, both the lowest and highest key numbers have advanced. It's probably fair to say I'm not seeing any problem with the queue itself.