Add fallback/file buffering when requests can't be sent

alsi-lawr commented 3 months ago

Is your feature request related to a problem? Please describe.

When the specified OTLP endpoint is unreachable or otherwise incapable of receiving the request, it would be great to have support for a fallback to file (or a custom sink). This will ensure that no data loss would occur in the event of a failure to export, as this sink is being used in critical auditing infrastructure.

Proposal:

Change the IExporter interface for exports on log service requests to return some kind of information about the success/failure state of the exports. This can then feed into the sink to make a decision about whether to reroute the logs to a secondary sink (or keep it as a filesystem-only fallback), or to continue with ignoring the response. Expose an option to configure either a filesystem fallback location or a secondary sink fallback.

Edit: upon looking into it, we'd also need to catch exceptions in the case of unreachable grpc endpoints.

Describe alternatives you've considered

Creating a resilient middleware that can receive the logs, however this still requires a network request.
Overriding the implementations myself to ensure a fallback is present.

Additional context

I'd be happy to have a crack at implementing this for the filesystem-only approach if there's nothing already in the works.

nblumhardt commented 3 months ago

Hi @alex-lawrence-conf, thanks for the ticket and PR!

The mechanism you're describing is very similar to what's being discussed in https://github.com/serilog/serilog/issues/1791#issuecomment-1884134585 - just a sketch, but the central idea is to implement fallbacks using shared infrastructure that multiple sinks could benefit from.

The surface syntax for the most common usage would probably end up being something like:

    .WriteTo.FallbackChain(wt => wt.OpenTelemetry(...), wt => wt.File(...))

(Naming TBC.)

In this scenario, the sink configured by WriteTo.OpenTelemetry would implement a Serilog-defined interface like ILoggingFailurePublisher, which the FallbackChain() method would dynamically check for and call SetFallbackListener(...) on when present (or throw otherwise).

There's a bit involved, and some help rolling this out would would be welcome, but I probably need to whip up a spike to get the ball rolling since there are multiple use cases we had in mind for the mechanism. I'll try to loop back in the next few days 🤞

alsi-lawr commented 3 months ago

Hi @alex-lawrence-conf, thanks for the ticket and PR!

The mechanism you're describing is very similar to what's being discussed in serilog/serilog#1791 (comment) - just a sketch, but the central idea is to implement fallbacks using shared infrastructure that multiple sinks could benefit from.

The surface syntax for the most common usage would probably end up being something like:
    .WriteTo.FallbackChain(wt => wt.OpenTelemetry(...), wt => wt.File(...))
(Naming TBC.)

In this scenario, the sink configured by WriteTo.OpenTelemetry would implement a Serilog-defined interface like ILoggingFailurePublisher, which the FallbackChain() method would dynamically check for and call SetFallbackListener(...) on when present (or throw otherwise).

There's a bit involved, and some help rolling this out would would be welcome, but I probably need to whip up a spike to get the ball rolling since there are multiple use cases we had in mind for the mechanism. I'll try to loop back in the next few days 🤞

Thanks @nblumhardt for the response. This mechanism sounds like the perfect solution for what I am looking for. I think the majority of what I've done here is easily generalisable by design and could rather trivially be extracted to a fallback chain in the way you describe.

I think from what I've seen making the fallback for this module, an appropriate design choice would be to have sinks opt in to the fallback by exposing their own extensions on a new fallback version of the type returned by WriteTo that contains that publisher instantiated already, so the consuming package can just inject that into the sink and do whatever logic it needs to determine the log state.

Either that, or we could just expose a result type on the sink emissions in a new IFailureLogSink interface using that enum as you've described, and then the fallback chain infrastructure in serilog can determine it's own retry protocol and pacing profile for the different methods.

Just a few design questions:

is serilog by design supposed to use deferred execution, as I've done extensively in my PR? The benefits I saw was it allowed for elegant deferred (and fluent) function chains that are lazily executed on task execution
is it the sink that should be responsible for retry pacing? Or would it make more sense that, since the sinks are going to expect to implement something to do with reporting failure states, that a shared retry mechanism could be implemented that does it for the sinks that opt in to the IFailureLogSink? If serilog would be responsible for retries, I think that it would make more sense for the sink to not just return its logging state, but the action it tried to do to maintain log state after any sink transformations. So the result type would look something like:
```
public struct LogEmissionState
{
private enum EmissionState
{
    Failed, 
    Success,
    Retry, 
} 

private Action? _retryAction;
private EmissionState _emissionState;
... Factory methods for the combinations
}
```

alsi-lawr commented 3 months ago

@nblumhardt a quick point about how this use case wouldn't be generalisable is that what I've implemented captures the exact request that would have been sent as either Protobuf or NDJson. We'd lose all of the open telemetry information captured and transformed if it just sends the original logevent down a chain external to this sink

nblumhardt commented 2 months ago

Hi @alsi-lawr; https://github.com/serilog/serilog/pull/2108 implements a spike of this; although it might take a bit of shuffling of code, hopefully the need to carry through some OTel-specific things like resource attributes can be worked around. Keen for some feedback if you have a chance to look in on it. Thanks for the nudge on this!

alsi-lawr commented 1 week ago

@nblumhardt is there scope to implement the fallback features in this package now?

I can work on integrating some of the otel-specific things from my fork into the new framework, and work on implementing the ISetLoggingFailureListener interface to forward those to a listener.

It should be trivial to then implement the ILoggingFailureListener in my own code.

Is there a contribution style guide, by the way?

nblumhardt commented 1 week ago

Thanks for the ping! Returning to this, I'm not 100% clear on how usage would look, could you possibly post a short snippet showing what the configuration (in C#) would look like, and how the data would flow between sinks? For resource attributes I'm wondering if the solution to this might instead just be using enrichers on the fallback sinks to explicitly add these as regular properties :thinking:

alsi-lawr commented 1 week ago

Thanks for the ping! Returning to this, I'm not 100% clear on how usage would look, could you possibly post a short snippet showing what the configuration (in C#) would look like, and how the data would flow between sinks? For resource attributes I'm wondering if the solution to this might instead just be using enrichers on the fallback sinks to explicitly add these as regular properties 🤔

The way I approached this was by re-emitting a logevent with the otlp message as a single property for an empty message.

I'm envisioning a simpler solution that just does exactly what you're suggesting.

And a second that does what I've implemented already as a class OtlpMessageFallbackSink : OpenTelemetryLogsSink that will attach the otlp raw message contents to the log event.

nblumhardt commented 1 week ago

Thanks for the follow-up. I think to do anything specific for fallbacks in this sink we'd need to dig a bit deeper into examples of what's possible with the default approach and how the modified version would differ.

If the fallback chain was targeting an OTLP sink, all of the OTLP info would/could just be re-added by specifying the same resource attributes on the fallback sink.

If the fallback chain was targeting a different sink then OTLP message bodies probably wouldn't be useful.

I might be missing something; let me know if so. Cheers!

alsi-lawr commented 1 week ago

Thanks for the follow-up. I think to do anything specific for fallbacks in this sink we'd need to dig a bit deeper into examples of what's possible with the default approach and how the modified version would differ.

If the fallback chain was targeting an OTLP sink, all of the OTLP info would/could just be re-added by specifying the same resource attributes on the fallback sink.

If the fallback chain was targeting a different sink then OTLP message bodies probably wouldn't be useful.

I might be missing something; let me know if so. Cheers!

So, my use case is that we need to have a fallback that retains all information at the moment of logging, as it's being used for financial auditing. Currently, this is logged to disk as a backup mechanism if connection to our otel-collector goes down. We then have a secondary process that ingests those binary logs (newline delimited grpc messages) and attempts retries.

This was important to have these logs persist in all situations without the reasonable possibility of loss.

Currently, this isn't possible as we have no way to hook into the gRPC call with all the context and pre-built message on failures at all without forking the project.

Re-generating the message could lose important information we require for financial audit logs. With the default implementation, we just the same context as the incoming log message before attaching resource information, so I'd need to maintain mirrored logic for that log enrichment done in this package in order to get close to the original message in another sink.

nblumhardt commented 1 week ago

Thanks for the follow-up! Perhaps for this use-case a local, buffering, OTLP forwarder service and AuditTo support would be a viable option? On the surface it's a fairly uncommon set of requirements so I'm hesitant to add specific API support for it in the sink at this stage.

serilog / serilog-sinks-opentelemetry