open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.43k stars 1.46k forks source link

queue retry: support tiered queue #6331

Open hanjm opened 2 years ago

hanjm commented 2 years ago

Is your feature request related to a problem? Please describe. I'm adopting queue retry with filestorage in collector, but i found significant performance downgrade when enable filestorage, every otel data will write to filestorage and read from filestorage.

Describe the solution you'd like Currently implementation is one of memory queue and disk queue. https://github.com/open-telemetry/opentelemetry-collector/blob/ee02bc7b2bbc71781cb96bcca79c9b387096dcf7/exporter/exporterhelper/queued_retry.go#L123-L126 https://github.com/open-telemetry/opentelemetry-collector/blob/ee02bc7b2bbc71781cb96bcca79c9b387096dcf7/exporter/exporterhelper/queued_retry.go#L155-L166 Should we implement a tiered queue, memory queue as primary queue, disk queue as secondary queue.

if the memory queue is full when export endpoint down, will use disk queue.

Describe alternatives you've considered

Additional context I known nsq use tiered queue. image

jefchien commented 1 year ago

Please assign this to me. I was going to open a similar issue and have already written something which defaults to the current behavior (one queue if enabled), but is configurable and can have a separate "backlog" queue enabled. It also handles sending overflow requests between the two.

swiatekm commented 1 year ago

This setup compromises on durability for performance, right? Items in the main queue will be lost if the collector is suddenly killed. I think this is fine, but should be documented.

jefchien commented 1 year ago

If the collector is shutdown gracefully, then the items in the main queue (assuming the main queue is the in-memory queue) will drain and get sent to the secondary backlog one, which will write it to disk. You're correct though that it is a compromise.

dmitryax commented 11 months ago

Instead of losing durability, maybe we can introduce a configurable memory buffer in the persistent queue for now? We will still be writing into the storage before accepting the data, but at least we won't be reading/unmarshalling from the persistence storage while the memory buffer has capacity. It should improve the performance. We'll see if that is enough...

cc @jefchien @swiatekm-sumo

jefchien commented 10 months ago

Is the idea that the memory buffer would serve as a copy of the head chunk of the persistent queue and we'd need to synchronize the deletes as the requests are read off the buffer? That certainly helps with durability, but I still see the benefit in having the option to configure the two queue types separately if we want to prevent writes to disk altogether during normal operation. Alternatively, we can make that configurable in the persistent queue as well, but that doesn't seem to fit the persistent queue.

JustinMason commented 9 months ago

Im interested in the issue. I am seeing poor performance under large spikes that slowly increase the queue size until the pvc fills up. Without the file storage queue the collector handles the volume and keeps up.

swiatekm commented 9 months ago

I think last time we talked about this, we wanted to try optimizing the persistent queue first so we can avoid at least the deserialization cost for each request. Then we can see if we still need the additional complexity of multiple queues.