Open swiatekm opened 1 year ago
@jpkrohling @djaglowski @frzifus you asked to be included during yesterday's SIG meeting.
We're seeing something that looks like this with splunk-otel-collector 0.102.0, which claims to have changes from opentelemetry-collector 0.102.0. The persistent queue file is upwards of 100GB, yet the Prometheus metrics show no backlog.
@johngmyers do you have access to the queue file? It's possible to inspect it with bbolt's CLI and check if there are keys outside of the queue's normal range.
$ sudo ~/bbolt keys exporter_splunk_hec_platform_logs_logs default
254468
254469
254470
254471
254472
254473
[...]
263190
263191
263192
di
ri
wi
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default di
0300000004e203000000000005e203000000000006e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default ri
07e2030000000000
$ sudo ~/bbolt get -format hex exporter_splunk_hec_platform_logs_logs default wi
1904040000000000
ri
decodes to 254471, so there are three keys before that.
The numbers in di
are 3, 254468, 0, 254469, 0, 254470, 0
This particular pod is in CrashLoopBackoff, from the container exiting with code 2. Log lines preceding the exit are about it being over the hard memory limit.
I think the livenessProbe
was configured too aggressively.
After giving the liveness probe an initial delay, both the lowest and highest key numbers have advanced. It's probably fair to say I'm not seeing any problem with the queue itself.
Describe the bug The persistent queue pulls items from the on-disk database, but fails to delete them after they've been successfully sent. The result is that the db has orphaned data in it that never gets deleted.
Steps to reproduce I haven't been able to reproduce this, but I have had a user send me their DB file, and have confirmed it contained items with indices less than the saved read index.
What version did you use? 0.77.0
What config did you use? The relevant part is:
Environment The report I have is from AWS EKS 1.23 with the default storage driver.
Additional context
7396 should technically help address similar problems, but this has happened quite consistently for this specific user in low disk space conditions, and the issue that change addresses should be fairly difficult to trigger.
I'm filing this in the hope that I can get more reports of this behaviour to help verify that it still occurs, and hopefully get to the bottom of the issue.