open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.73k stars 888 forks source link

Semantics of active Span (Latency vs. active duration, CPU times, ...) #330

Open Oberon00 opened 5 years ago

Oberon00 commented 5 years ago

As far as I know, there is currently no semantic meaning assigned to the active span, other than that it will become the parent of any new span that does not have an explicit parent set.

I think it would be very interesting to assign the meaning of "this Span is currently actively being processed on this thread" to that. This would allow additional timings on the span to be collected, e.g. in addition to the current measure of "latency" that results from end - start, we would have a "processing duration" that indicates how long the span (or any child of it) has been active in total. E.g. currently the span with the longest latency could be due to some other operations unrelated to that Span being scheduled in between.

I could also imagine using an API like http://man7.org/linux/man-pages/man2/getrusage.2.html or https://docs.oracle.com/javase/8/docs/platform/jvmti/jvmti.html#GetCurrentThreadCpuTime to measure the CPU time (user/sys) that has been consumed by a Span's operation.

tigrannajaryan commented 5 years ago

@Oberon00 this is an interesting idea and something that an SDK may choose to implement. However, I do not see why this needs to be part of the spec. Does being "active" in the sense that CPU times may be measured have any effect in how the span behaves otherwise from API's or SDK's perspective?

I think it would be very interesting to assign the meaning of "this Span is currently actively being processed on this thread" to that. This would allow additional timings on the span to be collected

How does the absence of that phrase disallow additional timing to be collected? Are you aiming to define a standard for collecting extra associated telemetry information (e.g. CPU times) for the active Span? i.e. the goal would be to tell where exactly in the Span data model this information is recorded. Is that the goal?

tsloughter commented 5 years ago

I think "active span" needs to be defined and preferably not to mean the span who's body is currently getting cpu cycles. It feels like it is too tied to low level implementation details.

In Go or Erlang I would think "active span" would be per-goroutine/process and refer to the span that would be in the context of that gorutine or process, unrelated to if the goroutine or process is currently running on a particular scheduler.

Oberon00 commented 5 years ago

@tigrannajaryan

I do not see why this needs to be part of the spec.

Because otherwise an SDK trying to track CPU time based on activeness of a Span relies on unspecified behavior. E.g. an integration may choose to not set the span as active at all if it knows that there won't be any child spans, then that SDK would report something wrong like "100% suspension" and the SDK-implementers couldn't even complain to the developers of the instrumentation.

What I suggest to be in the spec would be just something like:

A span should be activated on a context whenever the operation that is traced is worked on in that context, even if no child spans are created.

Oberon00 commented 5 years ago

@tsloughter

I think "active span" needs to be defined and preferably not to mean the span who's body is currently getting cpu cycles.

Fair enough. But what do you think about the more vague statement I suggest above?

tsloughter commented 5 years ago

@Oberon00 it is too implementation specific. I could see the use in knowing how much time on the cpu a span gets I just wouldn't want it to be a) tied to the idea of "active span" b) part of the spec, or at least not a required part.

In Erlang/Elixir we have reductions which each process gets a set amount of to run before they are preempted. I could see it being interesting to have a reduction count per-span, but very implementation specific and I think the only option we'd have for any sort of "active duration".

Oberon00 commented 4 years ago

@tsloughter How is that last statement implementation specific?

A span should be activated on a context whenever the operation that is traced is worked on in that context, even if no child spans are created.

I understand your concerns about the CPU timings though, that's why they don't occur in that statement anymore.

carlosalberto commented 4 years ago

Please comment in case this is of interest.

Moving it meanwhile to 0.4.

andrewhsu commented 4 years ago

@Oberon00 we talked about this at the spec sig mtg today, should this be closed in favor of https://github.com/open-telemetry/opentelemetry-specification/issues/591 ?

Oberon00 commented 4 years ago

591 is certainly related, but just related.

After #591 is resolved, this will still be true (quoting from the description):

There is currently no semantic meaning assigned to the active span, other than that it will become the parent of any new span that does not have an explicit parent set.

"the current measure of "latency" that results from end - start" is still the only measure we have. We do not know how much of this time was spent waiting for other operations (tracked by other spans), for example.

jkwatson commented 4 years ago

This was discussed briefly in today's spec meeting. Everyone agrees that this data is interesting, but also seems to agree that it is quite complex to get right 100% of the time. If we could start with a simple suggestion, that makes it very obvious what instrumentation should do in 80% of the cases, that would be helpful.