open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.39k forks source link

Simplifying stateful behavior #35013

Open VihasMakwana opened 2 months ago

VihasMakwana commented 2 months ago

Component(s)

extension/storage, extension/storage/filestorage, receiver/filelog

Is your feature request related to a problem? Please describe.

Currently, the collector operates in a stateless mode by default, with stateful components storing offsets in memory. However, stateful components should persist their state during shutdown if a storage extension is available.

At present, enabling stateful behavior involves a somewhat lengthy process:

  1. Adding a filestorage entry to the extensions stanza in the configuration.
  2. Including filestorage under service::extensions.
  3. Ensuring that the storage directory exists.
  4. Adding storage: file_storage/xyz to individual components.

It would be beneficial to simplify this process by introducing a feature gate or a single configuration option. With this approach, users could enable stateful mode with a single setting, and the necessary steps would be handled automatically, achieving the same effect as the manual steps described.

Describe the solution you'd like

Here’s one way to tackle the enhancement:

This is a high-level overview, and I expect it will require multiple pull requests to implement.

Describe alternatives you've considered

Please share if you have any thoughts!

Additional context

My solution would require https://github.com/open-telemetry/opentelemetry-collector/pull/11046 to be merged or a similar workaround to be implemented to combine lists in config.

github-actions[bot] commented 2 months ago

Pinging code owners:

VihasMakwana commented 2 months ago

any thoughts over this @djaglowski @atoulme ?

djaglowski commented 2 months ago

I think this is an issue which could be considered by the core maintainers. I will note that at one point the filelog receiver would automatically use a storage extension if exactly one were present, but this caused problems, e.g. adding a second storage extension would invalidate the config for an unrelated component.

VihasMakwana commented 2 months ago

I think this is an issue which could be considered by the core maintainers.

I see. @open-telemetry/collector-maintainers can you take a look here and share your thoughts?

I will note that at one point the filelog receiver would automatically use a storage extension is exactly one were present, but this caused problems, e.g. adding a second storage extension would invalidate the config for an unrelated component.

Would you happen to remember when this happened and around which collector version it was? I'll look it up.

VihasMakwana commented 2 months ago

@open-telemetry/collector-maintainers @open-telemetry/collector-contrib-maintainers can you take a look here and share your thoughts?

mx-psi commented 1 month ago

Would having something like the template provider described here https://github.com/open-telemetry/opentelemetry-collector/issues/8372 help? I would prefer to go with a more generic solution like this rather than a one-off thing if possible

VihasMakwana commented 1 month ago

@mx-psi Looking at the issue, it would certainly help. Is that being actively worked upon?

mx-psi commented 1 month ago

I don't think so (Dan can correct me if I am wrong :))

djaglowski commented 1 month ago

I'm not actively working on the template provider/converter because there was no clear path forward (that I could understand).

cmacknz commented 1 month ago

IMO, the fundamental problem here is that users are required to learn the technical implementation details of particular receivers to be able to use them in the optimal way. This makes successfully using the collector harder than it needs to be.

In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts. Users discover this by observing duplicated log lines, and search for a way to stop it. This leads them to discovering that log collection requires check pointing, discovering the storage extension, deciding the best place to store the check points for every deployment, and then configuring it for every filelog receiver they have.

It would be vastly simpler if filelog were able to persist state by default as it did before and people were able to opt out of it when necessary and they were willing to accept the consequences. This is the "works optimally by default" configuration choice.

I also suspect that filelog is not or will not be the only receiver that suffers from this problem, log collection is not the only use case that benefits from being stateful by default.

I'm not sure templates help this, you might end up with many templates that include a filelog receiver wanting to depend on a not yet declared storage extension or duplicating storage extensions. It also wouldn't help users who don't use templates (if they one day exist as proposed).

The official Helm chart does a reasonable job of trying to simplify this via the storeCheckpoints option, but if you don't start there or aren't on Kubernetes this doesn't help you.

djaglowski commented 1 month ago

IMO, the fundamental problem here is that users are required to learn the technical implementation details of particular receivers to be able to use them in the optimal way. This makes successfully using the collector harder than it needs to be.

In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts. Users discover this by observing duplicated log lines, and search for a way to stop it. This leads them to discovering that log collection requires check pointing, discovering the storage extension, deciding the best place to store the check points for every deployment, and then configuring it for every filelog receiver they have.

It would be vastly simpler if filelog were able to persist state by default as it did before and people were able to opt out of it when necessary and they were willing to accept the consequences. This is the "works optimally by default" configuration choice.

I also suspect that filelog is not or will not be the only receiver that suffers from this problem, log collection is not the only use case that benefits from being stateful by default.

I'm not sure templates help this, you might end up with many templates that include a filelog receiver wanting to depend on a not yet declared storage extension or duplicating storage extensions. It also wouldn't help users who don't use templates (if they one day exist as proposed).

The official Helm chart does a reasonable job of trying to simplify this via the storeCheckpoints option, but if you don't start there or aren't on Kubernetes this doesn't help you.

I agree it would be nice if state were persisted by default and that templates do not solve the problem. Ultimately though, in order to add such a capability, someone needs to articulate a design that convincingly accounts for assumptions that must be made. In my opinion, implicit behaviors can easily make things worse for users if the rely on poor assumptions and/or introduce tricky edge cases. This is not a comprehensive list but I think we'd need to pin down the following:


Regarding the filelog receiver in particular, I must correct a couple things:

In the case of filelog, by default it is going to re-ingest the contents of every file it is able to read whenever the collector restarts.

This is not the default behavior. It depends on start_at, which by default is set to end, meaning that a collector restart without a storage extension will jump to the end of files. (Also not ideal, but in many cases less bad than reingesting all logs.)

It would be vastly simpler if filelog were able to persist state by default as it did before

To be clear, persistence was never enabled by default. It would automatically use a storage extension if exactly one was configured, but this caused other problems. e.g. Adding a storage extension for the sake of another component would break the filelog reciever without touching it.

cmacknz commented 1 month ago

This is not the default behavior. It depends on start_at, which by default is set to end, meaning that a collector restart without a storage extension will jump to the end of files. (Also not idea, but in many cases less bad than reingesting all logs.)

Thanks for correcting this, I am still building up context on the entire story here.

Just to be explicit on what the tradeoff with start_at is: always jumping to the end of the file avoids unnecessary duplication at the cost of potentially missing log lines if filelog has not reached end of file before restarting or if the location of end of file changes during the restart.

For this issue I am starting from these two principles:

  1. To get the optimal result when reading logs you need to remember how much of each file you have already read. Doing this by default requires the collector to be stateful with persistent storage by default.
  2. It must be easy for users to run a stateless collector with no persistent storage regardless of 1, as long as they accept the documented consequences of doing this for use cases like log collection. Arguably this is still fine if inverted, where the collector is stateless by default and it is simple to configure persistent storage where needed by default.

You could solve 1 by having an implicit default storage extension that exists at a default directory for each operating system. This does nothing for 2 though, and the questions below clearly highlight the usability problems with this. Particularly the case where a user removes their manually linked storage extension and implicitly falls back to the default one is not very nice.

  • What assumptions does a default storage mechanism make? Is it an implicitly defined storage extension, or some other mechanism? How does a user override the default behavior?
  • If a user has manually defined and linked a storage extension, then removes it, will they understand that storage is still enabled but via a different mechanism?
  • How would this be rolled out without affecting users who don't currently use storage?
  • Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?

I think we could get away from these problems by having a higher level configuration concept for whether the collector should have persistent storage by default. Let's use the fault tolerant log collection example configuration as a starting point:

receivers:
  filelog:
    include: [/var/log/busybox/simple.log]
    storage: file_storage/filelogreceiver

extensions:
  file_storage/filelogreceiver:
    directory: /var/lib/otelcol/file_storage/receiver
  file_storage/otlpoutput:
    directory: /var/lib/otelcol/file_storage/output

service:
  extensions: [file_storage/filelogreceiver, file_storage/otlpoutput]
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/custom]
      processors: []

exporters:
  otlp/custom:
    endpoint: http://0.0.0.0:4242
    sending_queue:
      storage: file_storage/otlpoutput

If the concept of storage was lifted out of the extension as a higher level property of the collector, this would simplify to:

receivers:
  filelog:
    include: [/var/log/busybox/simple.log]

# --- NEW ---
storage:
  enabled: true
  path: /var/lib/otelcol/file_storage
  create_directory: true
# -----------

service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/custom]
      processors: []

exporters:
  otlp/custom:
    endpoint: http://0.0.0.0:4242

Doing this removes the need for users to know which receivers, processors, and exporters benefit from storage. They only need to decide whether to support persistent storage or not. Coming back to @djaglowski's questions:

  • What assumptions does a default storage mechanism make? Is it an implicitly defined storage extension, or some other mechanism? How does a user override the default behavior?

This proposal is an explicit declaration that the collector should have storage available for every component that benefits from it. There is an explicit definition of the location, and it can easily be overridden or turned off.

  • If a user has manually defined and linked a storage extension, then removes it, will they understand that storage is still enabled but via a different mechanism?

Yes, especially if storage.enabled is false by default then users have to opt in to this.

  • How would this be rolled out without affecting users who don't currently use storage?

Either storage.enabled defaults to false or users are required to turn it off to avoid being opted in to storage by default.

  • Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?

There would be a standard default directory per OS (TBD), and the simplest behavior is to attempt to create the directory if it does not exist, with the ability to turn this behavior off.

The underlying building block could still be the file_storage extension, where it has the effect of automatically creating one and wiring it up. It could alternatively just define and optionally create the path for storage, and actually creating the required files is done elsewhere. I'd prefer to get agreement that this should exist as a concept before digging too deeply into the exact implementation details. This definitely isn't a small change as proposed and impacts the core collector.

If there is a better place for this concept to live in the configuration, feel free to suggest it. I am not attached to the particular configuration location or name, more the simplification of how to make stateful components work correctly out of the box.

At an even higher level, I would like it if the collector only duplicated or lost logs if a user explicitly opted into that instead of being surprised by it.

VihasMakwana commented 1 month ago

@cmacknz I find your approach very compelling. I agree that internally it should use file_storage extension and and users should have the flexibility to override the storage at the receiver level for customization.

I’d love to get feedback from the @open-telemetry/collector-maintainers and @open-telemetry/collector-contrib-maintainers on this.


Does it require the file_storage extension? If so, which directory does it use on each OS? Do we assume that it exists, or require permissions to create it?

I think it should function without requiring an explicit declaration for file_storage, while still using it internally. For now, we can assume that it uses default directory mentioned here.

cmacknz commented 1 month ago

Thinking about this a bit more, a less drastic change to the configuration would be model the same functional behavior as a "default storage" extension or perhaps a "directory storage" extension. If present it would give components that naturally want to store state on disk a place to put it automatically (subject to reasonable ways to avoid having them collide). That would make it opt in and give users a way to get out of having to manually wire up storage extensions everywhere when it is used.