Add forceFlush to trace instruments

pokryfka commented 4 years ago

This is a follow up of https://github.com/pokryfka/aws-xray-sdk-swift/pull/16#discussion_r457828748.

I do not have a strong opinion on that, still I think its worth a discussion and it exceeds scope of a comment in PR.

XRayRecorder currently exposes:

the recorder currently exposes "NIO" flush:

    public func flush(on eventLoop: EventLoop) -> EventLoopFuture<Void> {}

as well as blocking wait without NIO dependency:

    public func wait(_ callback: ((Error?) -> Void)? = nil) { // }

@ktoso

Could you handle this in the recorder's shutdown?

yes for "services" (Vapor, Kitura, ...)

however in case of "lambda runtime", my understanding is that we do need to emit all the segments during event handling (event though it will make the request handling longer) because otherwise lambda may be suspended (and potentially never wake up again) before emitting finishes, example:

private struct ExampleLambdaHandler: EventLoopLambdaHandler {
    typealias In = Cloudwatch.ScheduledEvent
    typealias Out = Void

    private let recorder = XRayRecorder()

    private func doWork(on eventLoop: EventLoop) -> EventLoopFuture<Void> {
        eventLoop.submit { usleep(100_000) }.map { _ in }
    }

    func handle(context: Lambda.Context, event: In) -> EventLoopFuture<Void> {
        recorder.segment(name: "ExampleLambdaHandler", context: context) {
            self.doWork(on: context.eventLoop)
        }.flatMap {
            self.recorder.flush(on: context.eventLoop)
        }
    }
}

Lambda.run(ExampleLambdaHandler())

to be honest, while my explanation above does make sense to me, I do plan checking more in details how its implemented/enforced in "official" XRay SDKs;

also note that while in case of XRay emitting = sending UDP in local network, which should be comparatively inexpensive and may be finished (subject of more testing) before lambda is suspended even without explicit flushing; emitting spans using different TracingInstrument in lambda handler will be typically much mroe expensive and will take more time;

pokryfka commented 4 years ago

to be honest, while my explanation above does make sense to me, I do plan checking more in details how its implemented/enforced in "official" XRay SDKs;

@ktoso is there anyone in "swift-server" who could help to clarify that?

ktoso commented 4 years ago

We didn't yet add the forceFlush() API and should do so:

https://github.com/open-telemetry/opentelemetry-specification/blob/587bebac6ef773cf2540907b335b8478245ef836/specification/trace/sdk.md#forceflush

I very much agree with the wording they have there, normally it should not be necessary to invoke these thus the name "forceFlush" is very good. And it's just what xray+lambda will need here.

I think I previously mentioned adding a shutdown() that we should also do.

Reference:

Shutdown()

Shuts down the processor. Called when SDK is shut down. This is an opportunity for processor to do any cleanup required.

Shutdown should be called only once for each Processor instance. After the call to shutdown subsequent calls to onStart, onEnd, or forceFlush are not allowed.

Shutdown should not block indefinitely. Language library authors can decide if they want to make the shutdown timeout configurable.

ForceFlush()

Export all ended spans to the configured Exporter that have not yet been exported.

ForceFlush should only be called in cases where it is absolutely necessary, such as when using some FaaS providers that may suspend the process after an invocation, but before the Processor exports the completed spans.

ForceFlush should not block indefinitely. Language library authors can decide if they want to make the flush timeout configurable.

Yes we have relationships with some folks in Amazon (e.g. @tachyonics working on Smoke) that we can ping if we'd need help. This one I think it seems clear what we should do here though :)

pokryfka commented 4 years ago

Shutdown should be called only once for each Processor instance. After the call to shutdown subsequent calls to onStart, onEnd, or forceFlush are not allowed.

what should it mean that they are not "allowed"?

they have no effect, they throw?

ktoso commented 4 years ago

Your call how to treat this, in a specific sdk impl. I don't think the API should do anything here.

They cannot throw, options are to ignore or crash IMO; ignoring being likely just fine (can be configured I suppose).

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/sdk.md#shutdown

Feel free to ask on the spec repo for clarification.

ktoso commented 4 years ago

Would be nice to get this in, had a tracer where it would have been useful recently.

Proposed docs:

    /// Export all ended spans to the configured backend that have not yet been exported.
    ///
    /// This function should only be called in cases where it is absolutely necessary,
    /// such as when using some FaaS providers that may suspend the process after an invocation, but before the backend exports the completed spans.
    ///
    /// This function should not block indefinitely, implementations should offer a configurable timeout for flush operations.
    func forceFlush()

slashmo / gsoc-swift-tracing

Add forceFlush to trace instruments #85