Open VihasMakwana opened 2 months ago
Pinging code owners:
receiver/filelog: @djaglowski
See Adding Labels via Comments if you do not have permissions to add labels yourself.
any thoughts over this @djaglowski @atoulme ?
I think this is an issue which could be considered by the core maintainers. I will note that at one point the filelog receiver would automatically use a storage extension if exactly one were present, but this caused problems, e.g. adding a second storage extension would invalidate the config for an unrelated component.
I think this is an issue which could be considered by the core maintainers.
I see. @open-telemetry/collector-maintainers can you take a look here and share your thoughts?
I will note that at one point the filelog receiver would automatically use a storage extension is exactly one were present, but this caused problems, e.g. adding a second storage extension would invalidate the config for an unrelated component.
Would you happen to remember when this happened and around which collector version it was? I'll look it up.
@open-telemetry/collector-maintainers @open-telemetry/collector-contrib-maintainers can you take a look here and share your thoughts?
Would having something like the template provider described here https://github.com/open-telemetry/opentelemetry-collector/issues/8372 help? I would prefer to go with a more generic solution like this rather than a one-off thing if possible
@mx-psi Looking at the issue, it would certainly help. Is that being actively worked upon?
I don't think so (Dan can correct me if I am wrong :))
I'm not actively working on the template provider/converter because there was no clear path forward (that I could understand).
IMO, the fundamental problem here is that users are required to learn the technical implementation details of particular receivers to be able to use them in the optimal way. This makes successfully using the collector harder than it needs to be.
In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts. Users discover this by observing duplicated log lines, and search for a way to stop it. This leads them to discovering that log collection requires check pointing, discovering the storage extension, deciding the best place to store the check points for every deployment, and then configuring it for every filelog receiver they have.
It would be vastly simpler if filelog were able to persist state by default as it did before and people were able to opt out of it when necessary and they were willing to accept the consequences. This is the "works optimally by default" configuration choice.
I also suspect that filelog is not or will not be the only receiver that suffers from this problem, log collection is not the only use case that benefits from being stateful by default.
I'm not sure templates help this, you might end up with many templates that include a filelog receiver wanting to depend on a not yet declared storage extension or duplicating storage extensions. It also wouldn't help users who don't use templates (if they one day exist as proposed).
The official Helm chart does a reasonable job of trying to simplify this via the storeCheckpoints option, but if you don't start there or aren't on Kubernetes this doesn't help you.
IMO, the fundamental problem here is that users are required to learn the technical implementation details of particular receivers to be able to use them in the optimal way. This makes successfully using the collector harder than it needs to be.
In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts. Users discover this by observing duplicated log lines, and search for a way to stop it. This leads them to discovering that log collection requires check pointing, discovering the storage extension, deciding the best place to store the check points for every deployment, and then configuring it for every filelog receiver they have.
It would be vastly simpler if filelog were able to persist state by default as it did before and people were able to opt out of it when necessary and they were willing to accept the consequences. This is the "works optimally by default" configuration choice.
I also suspect that filelog is not or will not be the only receiver that suffers from this problem, log collection is not the only use case that benefits from being stateful by default.
I'm not sure templates help this, you might end up with many templates that include a filelog receiver wanting to depend on a not yet declared storage extension or duplicating storage extensions. It also wouldn't help users who don't use templates (if they one day exist as proposed).
The official Helm chart does a reasonable job of trying to simplify this via the storeCheckpoints option, but if you don't start there or aren't on Kubernetes this doesn't help you.
I agree it would be nice if state were persisted by default and that templates do not solve the problem. Ultimately though, in order to add such a capability, someone needs to articulate a design that convincingly accounts for assumptions that must be made. In my opinion, implicit behaviors can easily make things worse for users if the rely on poor assumptions and/or introduce tricky edge cases. This is not a comprehensive list but I think we'd need to pin down the following:
file_storage
extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?Regarding the filelog receiver in particular, I must correct a couple things:
In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts.
This is not the default behavior. It depends on start_at
, which by default is set to end
, meaning that a collector restart without a storage extension will jump to the end of files. (Also not ideal, but in many cases less bad than reingesting all logs.)
It would be vastly simpler if filelog were able to persist state by default as it did before
To be clear, persistence was never enabled by default. It would automatically use a storage extension if exactly one was configured, but this caused other problems. e.g. Adding a storage extension for the sake of another component would break the filelog reciever without touching it.
This is not the default behavior. It depends on start_at, which by default is set to end, meaning that a collector restart without a storage extension will jump to the end of files. (Also not idea, but in many cases less bad than reingesting all logs.)
Thanks for correcting this, I am still building up context on the entire story here.
Just to be explicit on what the tradeoff with start_at
is: always jumping to the end of the file avoids unnecessary duplication at the cost of potentially missing log lines if filelog has not reached end of file before restarting or if the location of end of file changes during the restart.
For this issue I am starting from these two principles:
You could solve 1 by having an implicit default storage extension that exists at a default directory for each operating system. This does nothing for 2 though, and the questions below clearly highlight the usability problems with this. Particularly the case where a user removes their manually linked storage extension and implicitly falls back to the default one is not very nice.
- What assumptions does a default storage mechanism make? Is it an implicitly defined storage extension, or some other mechanism? How does a user override the default behavior?
- If a user has manually defined and linked a storage extension, then removes it, will they understand that storage is still enabled but via a different mechanism?
- How would this be rolled out without affecting users who don't currently use storage?
- Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?
I think we could get away from these problems by having a higher level configuration concept for whether the collector should have persistent storage by default. Let's use the fault tolerant log collection example configuration as a starting point:
receivers:
filelog:
include: [/var/log/busybox/simple.log]
storage: file_storage/filelogreceiver
extensions:
file_storage/filelogreceiver:
directory: /var/lib/otelcol/file_storage/receiver
file_storage/otlpoutput:
directory: /var/lib/otelcol/file_storage/output
service:
extensions: [file_storage/filelogreceiver, file_storage/otlpoutput]
pipelines:
logs:
receivers: [filelog]
exporters: [otlp/custom]
processors: []
exporters:
otlp/custom:
endpoint: http://0.0.0.0:4242
sending_queue:
storage: file_storage/otlpoutput
If the concept of storage was lifted out of the extension as a higher level property of the collector, this would simplify to:
receivers:
filelog:
include: [/var/log/busybox/simple.log]
# --- NEW ---
storage:
enabled: true
path: /var/lib/otelcol/file_storage
create_directory: true
# -----------
service:
pipelines:
logs:
receivers: [filelog]
exporters: [otlp/custom]
processors: []
exporters:
otlp/custom:
endpoint: http://0.0.0.0:4242
Doing this removes the need for users to know which receivers, processors, and exporters benefit from storage. They only need to decide whether to support persistent storage or not. Coming back to @djaglowski's questions:
- What assumptions does a default storage mechanism make? Is it an implicitly defined storage extension, or some other mechanism? How does a user override the default behavior?
This proposal is an explicit declaration that the collector should have storage available for every component that benefits from it. There is an explicit definition of the location, and it can easily be overridden or turned off.
- If a user has manually defined and linked a storage extension, then removes it, will they understand that storage is still enabled but via a different mechanism?
Yes, especially if storage.enabled
is false
by default then users have to opt in to this.
- How would this be rolled out without affecting users who don't currently use storage?
Either storage.enabled
defaults to false
or users are required to turn it off to avoid being opted in to storage by default.
- Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?
There would be a standard default directory per OS (TBD), and the simplest behavior is to attempt to create the directory if it does not exist, with the ability to turn this behavior off.
The underlying building block could still be the file_storage
extension, where it has the effect of automatically creating one and wiring it up. It could alternatively just define and
optionally create the path for storage, and actually creating the required files is done elsewhere. I'd prefer to get agreement that this should exist as a concept before digging too deeply into the exact implementation details. This definitely isn't a small change as proposed and impacts the core collector.
If there is a better place for this concept to live in the configuration, feel free to suggest it. I am not attached to the particular configuration location or name, more the simplification of how to make stateful components work correctly out of the box.
At an even higher level, I would like it if the collector only duplicated or lost logs if a user explicitly opted into that instead of being surprised by it.
@cmacknz I find your approach very compelling. I agree that internally it should use file_storage
extension and and users should have the flexibility to override the storage at the receiver level for customization.
I’d love to get feedback from the @open-telemetry/collector-maintainers and @open-telemetry/collector-contrib-maintainers on this.
Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?
I think it should function without requiring an explicit declaration for file_storage
, while still using it internally.
For now, we can assume that it uses default directory mentioned here.
Thinking about this a bit more, a less drastic change to the configuration would be model the same functional behavior as a "default storage" extension or perhaps a "directory storage" extension. If present it would give components that naturally want to store state on disk a place to put it automatically (subject to reasonable ways to avoid having them collide). That would make it opt in and give users a way to get out of having to manually wire up storage extensions everywhere when it is used.
Component(s)
extension/storage, extension/storage/filestorage, receiver/filelog
Is your feature request related to a problem? Please describe.
Currently, the collector operates in a stateless mode by default, with stateful components storing offsets in memory. However, stateful components should persist their state during shutdown if a storage extension is available.
At present, enabling stateful behavior involves a somewhat lengthy process:
filestorage
entry to theextensions
stanza in the configuration.filestorage
underservice::extensions
.storage: file_storage/xyz
to individual components.It would be beneficial to simplify this process by introducing a feature gate or a single configuration option. With this approach, users could enable stateful mode with a single setting, and the necessary steps would be handled automatically, achieving the same effect as the manual steps described.
Describe the solution you'd like
Here’s one way to tackle the enhancement:
filestorage
extension.healtcheck
,pprof
,memorylimitter
, etc.This is a high-level overview, and I expect it will require multiple pull requests to implement.
Describe alternatives you've considered
Please share if you have any thoughts!
Additional context
My solution would require https://github.com/open-telemetry/opentelemetry-collector/pull/11046 to be merged or a similar workaround to be implemented to combine lists in config.