Question about span contexts/traces

cjohansen commented 2 months ago

We added a with-span! around our app's start function. Now all spans appear inside that single application start span. What we wanted was a single span that could give some insight about the startup, and then top-level traces for each top-level activity (e.g. message consumption).

Is it wrong to wrap the app start function with a span, or is there some way for us to break out of the context created by the first span? More generally: how do we explicitly start a new trace?

steffan-westcott commented 2 months ago

Each trace encapsulates a unit of work (transaction) performed in a system. For example, a trace captures all work performed in an HTTP server due to a single request.

Application startup is not considered a unit of work in this model, so a trace is not the correct telemetry signal to use here. To capture telemetry outside transaction processing, consider using signals other than traces, such as logs and metrics. In your case, adding log statements is likely the correct approach.

OpenTelemetry does not introduce a logging API. Instead, OpenTelemetry offers integration and adaptation of existing logging systems. clj-otel support for OpenTelemetry logs is planned for a future release.

OpenTelemetry has an experimental Events API designed for application developers to create named events that are added to logs. clj-otel will also support this in a future release.

cjohansen commented 2 months ago

I disagree with your narrow definition of a unit of work. The application startup is a unit of work that consists of multiple steps, and that I would like to detect anomalies and patterns in over time. Let's say I where to use logs instead as you suggest, here's what I'd need to do:

Record the start time
For each sub-system, record the start time, then log a message with the elapsed time and relevant configuration
Log a "system started" with elapsed time and relevant configuration

In other words: I'd have a homemade trace with log messages, manual code and log messages for spans. Not good.

In any case, my original question is still unanswered: how do I start a new trace? There are several cases where this would be useful. We have a persistent log in our application, I would like to put a message on a queue with the trace id, and resume the same trace in the consumer. This would require me to start a trace with a pre-determined trace-id. I imagine the same is true for client/server flows. How can I achieve this?

steffan-westcott commented 2 months ago

When using with-span!, you start a new trace by setting the :parent option to nil. In the following example, the ::doing-bar-things span appears in a distinct trace:

(defn do-bar
  []
  (span/with-span! {:name ::doing-bar-things
                    :parent nil}
    (span/add-event! "Inside do-bar")))

(defn do-foo
  []
  (span/with-span! ::doing-foo-things
    (span/add-event! "Start of do-foo")
    (do-bar)
    (span/add-event! "End of do-foo")))

The :parent option is a context value. Context propagation is a general concept in OpenTelemetry in which the context is transmitted across API boundaries, such as from server to client. opentelemetry-java provides text map propagators that read and write HTTP request headers to transmit the context. The W3C Trace Context propagation protocol is specifically for transmitting trace contexts between HTTP server and client. The OpenTelemetry instrumentation agent enables this propagator by default for supported HTTP servers and clients. For manual instrumentation of HTTP servers and clients, clj-otel supports this and other propagators with the functions context/->headers and context/headers->merged-context. You can see examples of these being used here and here.

OpenTelemetry traces support in messaging systems is growing, but I don't know how mature it is today. You can read further about messaging spans semantic conventions and context propagation here. The OpenTelemetry instrumentation agent supports message spans in many client libraries, such as RabbitMQ.

steffan-westcott / clj-otel

Question about span contexts/traces #18