serilog / serilog-sinks-opentelemetry

A Serilog OpenTelemetry Protocol (OTLP) sink
Apache License 2.0
117 stars 19 forks source link

Add fallback/file buffering when requests can't be sent #155

Open alsi-lawr opened 3 weeks ago

alsi-lawr commented 3 weeks ago

Is your feature request related to a problem? Please describe.

When the specified OTLP endpoint is unreachable or otherwise incapable of receiving the request, it would be great to have support for a fallback to file (or a custom sink). This will ensure that no data loss would occur in the event of a failure to export, as this sink is being used in critical auditing infrastructure.

Proposal:

Change the IExporter interface for exports on log service requests to return some kind of information about the success/failure state of the exports. This can then feed into the sink to make a decision about whether to reroute the logs to a secondary sink (or keep it as a filesystem-only fallback), or to continue with ignoring the response. Expose an option to configure either a filesystem fallback location or a secondary sink fallback.

Edit: upon looking into it, we'd also need to catch exceptions in the case of unreachable grpc endpoints.

Describe alternatives you've considered

Additional context

I'd be happy to have a crack at implementing this for the filesystem-only approach if there's nothing already in the works.

nblumhardt commented 3 weeks ago

Hi @alex-lawrence-conf, thanks for the ticket and PR!

The mechanism you're describing is very similar to what's being discussed in https://github.com/serilog/serilog/issues/1791#issuecomment-1884134585 - just a sketch, but the central idea is to implement fallbacks using shared infrastructure that multiple sinks could benefit from.

The surface syntax for the most common usage would probably end up being something like:

    .WriteTo.FallbackChain(wt => wt.OpenTelemetry(...), wt => wt.File(...))

(Naming TBC.)

In this scenario, the sink configured by WriteTo.OpenTelemetry would implement a Serilog-defined interface like ILoggingFailurePublisher, which the FallbackChain() method would dynamically check for and call SetFallbackListener(...) on when present (or throw otherwise).

There's a bit involved, and some help rolling this out would would be welcome, but I probably need to whip up a spike to get the ball rolling since there are multiple use cases we had in mind for the mechanism. I'll try to loop back in the next few days 🤞

alsi-lawr commented 3 weeks ago

Hi @alex-lawrence-conf, thanks for the ticket and PR!

The mechanism you're describing is very similar to what's being discussed in serilog/serilog#1791 (comment) - just a sketch, but the central idea is to implement fallbacks using shared infrastructure that multiple sinks could benefit from.

The surface syntax for the most common usage would probably end up being something like:

    .WriteTo.FallbackChain(wt => wt.OpenTelemetry(...), wt => wt.File(...))

(Naming TBC.)

In this scenario, the sink configured by WriteTo.OpenTelemetry would implement a Serilog-defined interface like ILoggingFailurePublisher, which the FallbackChain() method would dynamically check for and call SetFallbackListener(...) on when present (or throw otherwise).

There's a bit involved, and some help rolling this out would would be welcome, but I probably need to whip up a spike to get the ball rolling since there are multiple use cases we had in mind for the mechanism. I'll try to loop back in the next few days 🤞

Thanks @nblumhardt for the response. This mechanism sounds like the perfect solution for what I am looking for. I think the majority of what I've done here is easily generalisable by design and could rather trivially be extracted to a fallback chain in the way you describe.

I think from what I've seen making the fallback for this module, an appropriate design choice would be to have sinks opt in to the fallback by exposing their own extensions on a new fallback version of the type returned by WriteTo that contains that publisher instantiated already, so the consuming package can just inject that into the sink and do whatever logic it needs to determine the log state.

Either that, or we could just expose a result type on the sink emissions in a new IFailureLogSink interface using that enum as you've described, and then the fallback chain infrastructure in serilog can determine it's own retry protocol and pacing profile for the different methods.

Just a few design questions:

alsi-lawr commented 3 weeks ago

@nblumhardt a quick point about how this use case wouldn't be generalisable is that what I've implemented captures the exact request that would have been sent as either Protobuf or NDJson. We'd lose all of the open telemetry information captured and transformed if it just sends the original logevent down a chain external to this sink

nblumhardt commented 1 week ago

Hi @alsi-lawr; https://github.com/serilog/serilog/pull/2108 implements a spike of this; although it might take a bit of shuffling of code, hopefully the need to carry through some OTel-specific things like resource attributes can be worked around. Keen for some feedback if you have a chance to look in on it. Thanks for the nudge on this!