Export Jaeger spans on behalf of services / service endpoints

tillrohrmann commented 1 year ago

It could be a nice UX improvement to let the runtime export Jaeger spans on behalf of the service endpoints because then he wouldn't have to configure his service endpoints to do so. Moreover, the SDK wouldn't have to support exporting OTEL spans.

It is, however, currently not clear whether this is easily possible with the existing tracing and OTEL libraries.

Another aspect to consider is whether the emitted spans are correct if the runtime emits them on behalf of the service endpoint w/o exactly knowing what is happening on the service endpoint (e.g. the starting timestamps of active services would appear earlier then what they really are because the transmission and queuing latency is ignored).

slinkydeveloper commented 1 year ago

I assigned this to myself to investigate it, and collect our previous findings on the topic.

tillrohrmann commented 1 year ago

One answer to the problem could be that the Jaeger visualization is not the right tool for the job because it is intended for tracing physical activity whereas we are more interested in visualizing the logical activity. Maybe we could take the Jaeger data (traces and spans) and render it differently with our own visualization (e.g. the start and end span of an invocation could denote the length of a logical invocation).

slinkydeveloper commented 1 year ago

First and foremost, it is important to note that we already export all the available information we have in the runtime for the invocations, recording enough details to filter them out and to build an alternative presentation layer. From this point of view, no action needs to be taken IMO.

If I understand correctly the issue, to act on behalf of the service, we would need to set the Resource.name accordingly. As a first problem, no opentelemetry sdk will ever allow us to change in flight/setup on the spans what is the resource.name, as it's specified in their spec that the resource is immutable:

A Resource is an immutable representation of the entity producing telemetry as Attributes. For example, a process producing telemetry that is running in a container on Kubernetes has a Pod name, it is in a namespace and possibly is part of a Deployment which also has a name. When used with distributed tracing, a resource can be associated with the TracerProvider when the TracerProvider is created. That association cannot be changed later. When associated with a TracerProvider, all Spans produced by any Tracer from the provider MUST be associated with this Resource.

This means that we would need somehow to implement/fork a tracing exporter that allows us to change "in-flight" the resource.name field, so when we export we can "behave like a service endpoint". In theory, this is possible, as we just need to export a different batch of spans per service endpoint. In practice, there are a number of challenges to implement an exporter, including understanding the internals of tokio's tracing, the opentelemetry sdk, and I guess we would need over the time to support different trace export solutions/protocols, such as agent UDP Thrift/Protobuf, OTEL Protobuf just to name few, to retain the same flexibility provided by the opentelemetry sdk. These are the contracts of some of the "exporter APIs":

Because this solution would be something "non standard", there might even be the (very small) risk that in future our exporter won't be compatible anymore with new jaeger versions.

On top of that, I second the sentiment expressed by @tillrohrmann about generating "data that don't match the reality". If we emit spans on behalf of the endpoint, timestamps and events won't match, it won't be possible to know which pod actually processed the invocation, and it won't be possible to integrate tracing with external systems though W3C trace context (see https://github.com/restatedev/sdk-java/issues/76). The last point is in particular very important for our story of integrating with already existing systems, if we stay in the "tracing standards" it would be a piece of cake for a user to see a full trace, from the web ui to a restate service to some old internal microservice.

I'm personally in favor of working towards a custom presentation layer. If we plan to have a Restate GUI/CLI anyway in the long term, it is probably a better time investment to build a visualization tool part of this Restate GUI/CLI that takes data from Jaeger and visualizes as we want. This visualization might even just be a fork of the Jaeger UI, and not necessarily something we start from scratch.

jackkleeman commented 1 year ago

This means that we would need somehow to implement/fork a tracing exporter that allows us to change "in-flight" the resource.name field, so when we export we can "behave like a service endpoint"

Can we just have multiple tracingproviders, each set up for a given resource.name? Especially if we are using UDP delivery, this will not have too much cost, right?

EDIT: seems like in Java land there are people doing this https://github.com/open-telemetry/opentelemetry-java/issues/1109 and you can still share exporters so there is only 1 conn to the backend despite multiple resouces

even if we are uncomfortable with the memory implications here, I think we need to assess whether this is actually a difficult fork to support 'inflight' changes as you say. I don't think we should presume that its going to be hard to keep having a flexible tracing library despite this change. And I also don't think we should introduce complexity now because we don't know if we might want to support other trace backends in future - we need a simple tracing solution that fits users intuitions, and this might be a way to get one very rapidly

jackkleeman commented 1 year ago

On top of that, I second the sentiment expressed by @tillrohrmann about generating "data that don't match the reality". If we emit spans on behalf of the endpoint, timestamps and events won't match

I'm still not sure I understand this. We can just report the runtimes timestamps - we report when we became aware of the service making a call, reading some state, sleeping, etc. The trace will match the journal. What reality does it not match? Only things the service did that weren't journaled - which is ok, right?

jackkleeman commented 1 year ago

re presentation layer - do we agree that it is possible for jaeger to have the data so that it shows exactly what we want? So isnt the question simply, how do we get that data to jaeger. Perhaps transformation steps are needed before it gets there - whether in our code, or somewhere else. But I am not sure how this is a UI problem, when Jaeger UI fundamentally shows whatever the data says

jackkleeman commented 1 year ago

looking into the opentelemetry library, the Resource data is just a COW reference that gets read out the TracerProvider config then put into every SpanData. I don't see any reason why every SpanData couldn't have a different Resource - or why the batches need to be grouped by Resource. So, I think its a very minor change to allow the Span itself to override some Resource fields. I'll test this out - maybe that's all we need

slinkydeveloper commented 1 year ago

One additional aspect I didn't included in my initial comment which makes me lean toward the "custom presentation layer" is the "end span problem". As shown here https://github.com/restatedev/restate/pull/211 we have for each invocation a service_invocation span and a end_invocation span child of service_invocation. This is unfortunately needed because:

When committing a span, it must have a start and an end and it cannot be later changed: https://github.com/jaegertracing/jaeger/issues/729
Because of the aforementioned issue, we can commit the span either:
- When the invocation finishes
- As we're doing now, splitting the event "start" and "end" in two different spans

The reason for which we're doing the latter is because we want to show users the invocation progress. In particular for long workflows, it is important to show the progress of the invocation before it ends.

tillrohrmann commented 1 year ago

With #552 we will switch from using UDP to gRPC for reporting OTLP trace data.

I'm still not sure I understand this. We can just report the runtimes timestamps - we report when we became aware of the service making a call, reading some state, sleeping, etc. The trace will match the journal. What reality does it not match? Only things the service did that weren't journaled - which is ok, right?

Assume for example that service A invokes service B and the runtime sees the invoke journal entry. If it creates a span for invoking service B when it sends the invocation (from within the invoker) then it records how it looks from the runtime perspective (e.g. when the call went out). The runtime does not necessarily know the exact machine on which the call will run nor when it will exactly start. If the service endpoint would create the span instead, then one could see how long it took from sending the message from the runtime to actually executing code.

Note that we are already creating spans for all kinds of journal actions/entries as they arrive at the runtime (just under the runtime process identity and not the respective service identity).

I guess one question would be how important tracing the exact physical activity is vs. how sufficient is it to trace what the runtime thinks currently happens. Do people use trace information to diagnose infrastructure problems like congestion or so by looking at the queuing times, for example? Do people want to have information on which pod exactly a service was executed from their traces?

One can of course also opt-in to a more accurate tracing if one enables it on the SDK (assuming that the SDK has support for it at some point).

jackkleeman commented 1 year ago

I personally think that the superpower here is that you can get a really good view into what your application is doing by only looking at Restate. If you want more than that - add tracing to your application! Personally I use tracing to get an idea of the flow of what happened, not to get super accurate durations.

I think the opt in thing is critical here - some stuff eg what pod it ran on we can never provide without them setting up tracing on their endpoint. But, having a really amazing observability story with no such setup is a huge win for us!

jackkleeman commented 1 year ago

With https://github.com/restatedev/restate/issues/552 we will switch from using UDP to gRPC for reporting OTLP trace data.

In any case, the exporter and the traceprovider are separate; there will still be one grpc client, even if we had many trace providers

But a fork might be simpler :)

jackkleeman commented 1 year ago

It looks like with the new OTLP protocol, the span Resource (inc service name) is no longer required to be the same for a given batch. When you send spans, you can group them into as many different resources as you like in one request. I've made a small change to pull service.name out into a Resource for each span, and it looks like this:

However, I think we would actually want something a little different. At info level, I think we just want a clear flow of what the service did; it was invoked, it called get, it called set, it called another service, etc. Currently there is a lot of restate infrastructural stuff in there which is fairly hard to parse. And I am not sure I can see any of get state, create timer etc; maybe these things are not tied to a particular invocation span at the moment?

tillrohrmann commented 1 year ago

That's nice of OTLP :-)

And I am not sure I can see any of get state, create timer etc; maybe these things are not tied to a particular invocation span at the moment?

I think the journal actions (get, set, clear state, invoke another service, etc.) are logged as events that get associated to their respective span. You should see them in the logs section if I am not mistaken (assuming that you enabled DEBUG logging for tracing (default is info, I believe)).

At info level, I think we just want a clear flow of what the service did; it was invoked, it called get, it called set, it called another service, etc.

That would indeed be nice. At the same time it would be great if this information wouldn't be logged on info level to the log file because a per invocation logging can become quite verbose. Since logging and tracing use both the same infrastructure, we currently log the detailed invocation actions (get, set, invoke, etc.) as debug so that they don't appear by default in the logs.

jackkleeman commented 1 year ago

Yeah, im trying to find them but not succeeding at the moment. I think it would be cool if we could do spans for these things

The reason for which we're doing the latter is because we want to show users the invocation progress. In particular for long workflows, it is important to show the progress of the invocation before it ends.

How do we want suspending to work here? I would have expected that an invocation ends on suspension, so we will get an end span pretty regularly and we don't have to worry about in progress spans. Would that solve the end invocation problem? Then on replay, you see a new span showing just the new journal entries, showing that the service 'picked up where it left off'

tillrohrmann commented 1 year ago

Yeah, im trying to find them but not succeeding at the moment. I think it would be cool if we could do spans for these things

You need to set the jaeger filter to debug, I believe. By default it is set to info.

tillrohrmann commented 1 year ago

How do we want suspending to work here? I would have expected that an invocation ends on suspension, so we will get an end span pretty regularly and we don't have to worry about in progress spans. Would that solve the end invocation problem? Then on replay, you see a new span showing just the new journal entries, showing that the service 'picked up where it left off'

I think we still need a parent span that shows the whole duration of an invocation. Suspension is a Restate detail. From the service's perspective it is still time that the service needs to complete its invocation.

jackkleeman commented 1 year ago

fair enough! i feel that we are once again talking about the edge case of very long running invocations... i think users intuition will be that tracing shows only completed requests, and 95% of the time this will be within a few minutes of the request invoking. I understand there might be a demand for showing in progress invocations, but this feels like a different observability tool really - lets remember that in production, you normally have a trace sampling prob of 0.01 or lower (but often higher for failed requests) - so you really can't bank on any given request being traced unless it failed (in which case, it has completed). you often use tracing to get an idea of the categories of requests that are being made, and how they flow - so its fine to only see completed requests, imo

tillrohrmann commented 1 year ago

I am not arguing in favor of the end span here. I think it can also be fine to only see the spans once the invocation is done. My point is that it would be nice if there is a single span or two markers that describe the duration of an invocation. If we want then one can still show periods when it ran on the service endpoint until it hit a suspension as a child span.

jackkleeman commented 1 year ago

Yes, I understand - i agree

btw - it looks like temporal is hitting into similar complexities with jaeger spans as us https://github.com/temporalio/sdk-typescript/pull/278

jackkleeman commented 1 year ago

turns out its possible without forking otel libs :). we can wrap the otel processor with our own which modifies spans right before they get sent. I think that alone is sufficient

tillrohrmann commented 1 year ago

turns out its possible without forking otel libs :). we can wrap the otel processor with our own which modifies spans right before they get sent. I think that alone is sufficient

Nice, that be really great :-)

jackkleeman commented 1 year ago

this is the current status - emitting the service_invocation after completion. it requires storing the parent span ID of the invocation, whereas previously we just stored the span context (ie, trace id and span id)

with invocation and sleep durations, i feel we have arrived at a sensible checkpoint, where the next github issue would be to set up durations for all operations that go via the log, to show the raft overhead. i will start to clean up the code i have and get it into a usable state

restatedev / restate

Export Jaeger spans on behalf of services / service endpoints #547