Drain persistent queue on shutdown

ZachTB123 commented 2 years ago

I would like to ensure that if I am using a persistent queue, that queue is drained on shut down. #5110 found issues with the bounded memory queue because it immediately returned once it read from the stop chan rather than finish consuming the remaining items in the queue. It appears that the persistent queue has the same issue since it returns here when the stop chan is read? Assuming that my assumption is true, could the persistent queue be updated to drain its items on shutdown?

swiatekm commented 2 years ago

I'm not sure this is good default behaviour. Persistent queue is often used to buffer data beyond memory capacity, so it might be larger and take longer to drain. It's not a great user experience to send a SIGTERM to the collector process and then wait for 10 minutes for it to exit, and most orchestrators will typically follow up with a SIGKILL a short time afterwards. Maybe it's not so bad if we log a warning about it?

ZachTB123 commented 2 years ago

I'm not sure this is good default behaviour.

Yeah, I wouldn't necessarily expect this to be the default behavior but rather a configurable option. For those that do want this functionality, they would have to configure this in their exporter settings and make the necessary changes on their deployment to ensure their pods can stay up long enough to get through the queue, i.e. by extending terminationGracePeriodSeconds. And to your point, probably log a message that this is going to happen.

bogdandrutu commented 2 years ago

If that is the case, why not using an inmemory queue?

ZachTB123 commented 2 years ago

If that is the case, why not using an inmemory queue?

We still want the benefits of persistent queueing over inmemory queueing. We are in the process of evaluating replacing an existing telemetry pipeline with OpenTelemetry and I am open to feedback around what our approach should be. The two pieces we need to ensure are: minimal to no data loss and data delivery in a timely fashion to the backends, i.e. < 10 seconds. The pipeline we want to replace handles 30+ TB of just logs per day and is able to handle backend (splunk, etc.) outages for 7 days (data can be retried up to 7 days with no data loss). We want to replicate that with OpenTelemetry. We could use an inmemory queue, but with the volume of data we are expecting, I'm worried about scenarios where pods could be killed because they are OOM (or some other crash) and then we lose all that data. This is why we are looking at persistent queueing. We'll make sure the volumes we mount are sized appropriately to handle extended backend outages. And we should be able to handle unexpected crashes because the data will be in the persistent queue. With that approach, I think we can meet the requirement of minimal to no data loss but now I'm thinking about the second requirement of data delivery in a timely fashion. Obviously there won't be a timely delivery when backends have extended outages. My main focus with this requirement is handling a scenario where there is no backend outage (data is flowing as normally) but the collector is a little behind on data so it won't get through all the data in the queue before it shuts down. A scenario for this could be with autoscaling enabled and the number of replicas in the collector is being scaled down. Ideally I want to get through all the data in the queue so I don't have to wait for it to be scaled up again to finish processing. The other issue I opened mentioned making sure your scaling rules don't scale down too early. We'll definitely make that change but it would be nice to have a guarantee, like the inmemory queue, incase we get the scaling wrong.

bogdandrutu commented 2 years ago

There are lots of edge cases here when it comes to this: what do you do if the backend is down and you drain? What do you do if you have a Shutdown (for scaling up) and have lots of data in the queue?

So most likely you need some sort of guards and control on when to drain and when to not drain the queue (unless you are willing to wait indefinitely which may cause issues with your orchestration system etc.)

Have you thought about using a proper queue instead (Kafka) for this scenario? I feel that will simplify a lot of your cases.

Happy to review a proposal of the option/control you need, but my personal opinion is that a proper queue system fits better this scenario

ZachTB123 commented 2 years ago

There are lots of edge cases here when it comes to this

Yep. I'm at the point where I need to consider these now, so apologies if I seem to dwell on these.

Have you thought about using a proper queue instead (Kafka) for this scenario? I feel that will simplify a lot of your cases.

Yeah, I've been thinking about this. Can you expand your thoughts on this? Our current collector has an otlp receiver and then exports to the specific backends. Are you suggesting we add a queue in between? So otlp receiver, export to queue, receive from queue, export to backends? Or are you suggesting creating a storage extension for a queue that can be used with persistent queuing? I like the idea of a queuing system. I just need to ensure minimal data loss if the queue would be down since that is just another hop (hold onto data long enough for it to come back up).

swiatekm commented 2 years ago

Yeah, I've been thinking about this. Can you expand your thoughts on this? Our current collector has an otlp receiver and then exports to the specific backends. Are you suggesting we add a queue in between? So otlp receiver, export to queue, receive from queue, export to backends? Or are you suggesting creating a storage extension for a queue that can be used with persistent queuing? I like the idea of a queuing system. I just need to ensure minimal data loss if the queue would be down since that is just another hop (hold onto data long enough for it to come back up).

I imagine it'd be something like the following if you're using Kafka: otlpreceiver -> processors -> kafkaexporter -> Kafka -> kafkareceiver -> exporters without queues.

As I said earlier, I don't think the current API for storage clients is expressive enough to write a performant queue, even if the underlying store had all the necessary semantics.

bogdandrutu commented 2 years ago

As I said earlier, I don't think the current API for storage clients is expressive enough to write a performant queue, even if the underlying store had all the necessary semantics.

@swiatekm-sumo please open an issue discuss with @djaglowski and change the Storage API to allow that.

swiatekm commented 2 years ago

As I said earlier, I don't think the current API for storage clients is expressive enough to write a performant queue, even if the underlying store had all the necessary semantics.

@swiatekm-sumo please open an issue discuss with @djaglowski and change the Storage API to allow that.

Are we sure we want that? My comment was in the context of queues used by multiple processes. The Storage API exposes the storage as a transactional KV store, which I think would require some kind of busy wait for a multi-process queue even if the underlying store supported concurrent access from multiple processes. In order to to make this efficient, I think we'd need either a locking API, or an explicit data type with atomic push/pop semantics like Redis has with its LIST. Do we really want to require this of all storage extensions? WDYT @djaglowski ?

bogdandrutu commented 2 years ago

I think we should discuss this in a separate issue, since this issue is not about that :)

bogdandrutu commented 2 years ago

Yeah, I've been thinking about this. Can you expand your thoughts on this? Our current collector has an otlp receiver and then exports to the specific backends. Are you suggesting we add a queue in between? So otlp receiver, export to queue, receive from queue, export to backends? Or are you suggesting creating a storage extension for a queue that can be used with persistent queuing? I like the idea of a queuing system. I just need to ensure minimal data loss if the queue would be down since that is just another hop (hold onto data long enough for it to come back up).

@ZachTB123 what I would do is this, Receive OTLP/HTTP (GRPC, whatever) then export in OTLP format to Kafka asap (means without any queuing or anything). After that on the consumer side you have a collector with Kafka receiver that gets an entry from the queue, process it, exports it to the backend (without any queue), if gets success back then ack the message in Kafka .

mx-psi commented 2 months ago

I have removed this from Collector 1.0, I don't think this needs to be part of it, we are not looking to change the default behavior and it seems like something we can add under a configuration flag.

open-telemetry / opentelemetry-collector

Drain persistent queue on shutdown #5902