Open ZachTB123 opened 2 years ago
/cc @swiatekm-sumo @pmm-sumo ?
File storage client is using bbolt underneath. One of its limitations is that the same database cannot be used by multiple processes. (I believe it's worth adding this as a note to README)
The solution would be to configure the replicas in a way that uses separate path for each of them or to use another type of storage extension, which should be possible after the recent updates made by @swiatekm-sumo. I haven't tested it with dbstorage but it could work (though performance impact will might be prohibitive). @swiatekm-sumo what do you think?
Another solution might be building in a mechanism which could avoid conflict in naming different replica storage files
I don't think this is really bbolt's fault - we get a nice error because it doesn't allow concurrent access, but if it did, the collectors would clobber each others' data. Extensions give components storage clients based on those components' identity, which is what allows the data to persist over restarts, but the identity of the collector itself doesn't really enter into it, and if it did, we'd need to be very careful about determining it.
I think it's reasonable to expect that multiple storage extensions (be it from the same collector instance or from different ones) should have separate storage directories. We should document this better and try to have better error handling for it, though.
This makes @ZachTB123's use case awkward, and would require parametrising the storage locations with something like the Pod name, injected via the Downwards API. To be completely honest, doing it this way in Kubernetes feels like an antipattern - if you want persistent storage for a Deployment, you should just use a StatefulSet, and have a separate Volume for each Pod.
My concern with tying replicas to specific volumes (whether through paths or using StatefulSet) is making sure there is a way to ensure the persistent queue is completely drained before scaling down. For example, if there was an extended outage on the destination we are trying to export to, the replica's volume will start to fill up. Once the export destination is available again, that data in the persistent queue should start to drain. If the number of replicas were to decrease during this time (through auto scaling for example), I would like there to be some guarantee that the data in the volume used by that replica be completely emptied before being shut down. I don't want to encounter a situation where the data would still be in that volume and we would have to wait until the next time that replica comes up to drain the queue. In my situation I need to guarantee timely delivery of the data and I can make no guarantees on the number of replicas for our collector (besides having one replica minimum). This was where I was hoping we could just have one volume shared by all replicas to ensure the queue is drained no matter which replicas come and go from autoscaling.
My concern with tying replicas to specific volumes (whether through paths or using StatefulSet) is making sure there is a way to ensure the persistent queue is completely drained before scaling down. For example, if there was an extended outage on the destination we are trying to export to, the replica's volume will start to fill up. Once the export destination is available again, that data in the persistent queue should start to drain. If the number of replicas were to decrease during this time (through auto scaling for example), I would like there to be some guarantee that the data in the volume used by that replica be completely emptied before being shut down. I don't want to encounter a situation where the data would still be in that volume and we would have to wait until the next time that replica comes up to drain the queue. In my situation I need to guarantee timely delivery of the data and I can make no guarantees on the number of replicas for our collector (besides having one replica minimum). This was where I was hoping we could just have one volume shared by all replicas to ensure the queue is drained no matter which replicas come and go from autoscaling.
Currently, the queue assumes it's the sole consumer of the storage it's using. Even if we were to implement a storage extension allowing concurrent access from different processes (for example a Redis client), I don't think the current storage API is expressive enough to implement a queue like that. In principle, I think it'd be possible with a completely separate queue implementation, without using the storage at all.
I'm aware of the scaling problem, but it hasn't been an issue for us in nearly half a year of using OT to monitor our prod infrastructure. In practice, if you're burning down the queue, your instances should have high resource consumption (as long as you're not I/O bound on the exporter) and an autoscaler shouldn't scale them down.
If you actually want to pursue the idea of a queue sharable by multiple collector processes, we should open a new issue for it, I think.
I created #5902 for my previously mentioned concerns around draining the persistent queue.
Hello. I found the persistent sending queue is in alpha state for 2 years already. Is there a reason not to move it to beta?
I'm in the process of evaluating the persistent queueing functionality. I'm using the v0.57.2 release. I am unable to get persistent queuing to work when using multiple replicas of the collector. I currently have two different environments I'm testing this out in and I cannot get either to work. Both of these environments are expected to handle a large volume of telemetry so we have autoscaling set up.
Our first environment is using K8s. We are using the OpenTelemetry Helm chart and have autoscaling enabled. I created a PVC with the
ReadWriteMany
access mode and mounted it usingextraVolumeMounts
.persistent_storage_enabled
is set totrue
and thedirectory
used by thefile_storage
extension matches the mounted volume path. When the number of replicas is increased, the new replicas that were spun up fail to start due to a timeout which I assume is because it cannot get a lock on the file. Here is a snippet from the logs:Our other environment uses Azure Container Apps. I mounted a file share from Azure Files using this approach. As soon as I scale up to multiple replicas, all replicas stop working and constantly restart. I see messages like the following:
or
My question is: is persistent queuing supposed to work with a collector with multiple replicas?