Open hanjm opened 2 years ago
Please assign this to me. I was going to open a similar issue and have already written something which defaults to the current behavior (one queue if enabled), but is configurable and can have a separate "backlog" queue enabled. It also handles sending overflow requests between the two.
This setup compromises on durability for performance, right? Items in the main queue will be lost if the collector is suddenly killed. I think this is fine, but should be documented.
If the collector is shutdown gracefully, then the items in the main queue (assuming the main queue is the in-memory queue) will drain and get sent to the secondary backlog one, which will write it to disk. You're correct though that it is a compromise.
Instead of losing durability, maybe we can introduce a configurable memory buffer in the persistent queue for now? We will still be writing into the storage before accepting the data, but at least we won't be reading/unmarshalling from the persistence storage while the memory buffer has capacity. It should improve the performance. We'll see if that is enough...
cc @jefchien @swiatekm-sumo
Is the idea that the memory buffer would serve as a copy of the head chunk of the persistent queue and we'd need to synchronize the deletes as the requests are read off the buffer? That certainly helps with durability, but I still see the benefit in having the option to configure the two queue types separately if we want to prevent writes to disk altogether during normal operation. Alternatively, we can make that configurable in the persistent queue as well, but that doesn't seem to fit the persistent queue.
Im interested in the issue. I am seeing poor performance under large spikes that slowly increase the queue size until the pvc fills up. Without the file storage queue the collector handles the volume and keeps up.
I think last time we talked about this, we wanted to try optimizing the persistent queue first so we can avoid at least the deserialization cost for each request. Then we can see if we still need the additional complexity of multiple queues.
Is your feature request related to a problem? Please describe. I'm adopting queue retry with filestorage in collector, but i found significant performance downgrade when enable filestorage, every otel data will write to filestorage and read from filestorage.
Describe the solution you'd like Currently implementation is one of memory queue and disk queue. https://github.com/open-telemetry/opentelemetry-collector/blob/ee02bc7b2bbc71781cb96bcca79c9b387096dcf7/exporter/exporterhelper/queued_retry.go#L123-L126 https://github.com/open-telemetry/opentelemetry-collector/blob/ee02bc7b2bbc71781cb96bcca79c9b387096dcf7/exporter/exporterhelper/queued_retry.go#L155-L166 Should we implement a tiered queue, memory queue as primary queue, disk queue as secondary queue.
if the memory queue is full when export endpoint down, will use disk queue.
Describe alternatives you've considered
Additional context I known nsq use tiered queue.