Provide examples for Inter-Service-Tracing

NobbZ commented 1 year ago

What needs to be changed?

Currently "distributed tracing" is mentioned as a major selling point, and it is also mentioned that this works "somehow" via Context Propagation and Baggage.

There is not a single word though, how to use this between 2 independent services.

Therefore it would be nice if there were some examples showing this.

This example could be implemented as some docker compose that provides a visualizer as well as 2 services which interact with each other.

An example would be some webserver exposed as the "frontend" which can be curled as an echo server, which expects a name in the query (or in the path of the URL) and forwards this to the other service. The other service reverses it. Then the reversed string gets replied in the body to the client.

Ideally this example would exist per language.

Of course I see how the details might differ between different transports, though the important thing is how to get the context serialized and deserialzed. Getting it into/out of the transport is the easy part.

The main reason for this request: For a simple example implementation in Elixir it took me ~4 hours of reading documentation (which is mostly empty in this area of the library) and sourcecode before being able to properly see the "clients" trace ID in the "servers" parent trace ID.

(and this implementation is probably not as intended)

svrnm commented 1 year ago

I agree that it is something our docs should have eventually, see also the discussion in this ticket: https://github.com/open-telemetry/opentelemetry.io/issues/1862

A good starting point to learn is the demo which gives you a complex environment to play with, beyond that, as said in the other ticket as well ideally you should not worry yourself about context propagation: instrumentation libraries can take care of that for you, so if you use any of the following you should be covered:

https://opentelemetry.io/registry/?language=erlang&component=instrumentation

If not let the erlang community know that you have a need for a specific library that's not yet covered.

cc @open-telemetry/erlang-approvers

tsloughter commented 1 year ago

I've wanted similar to what is described. If every language implemented the same 2 client/server services then they could be swapped between in a docker-compose file and you expect the same result.

I also want this to do actual testing of implementations matching the spec, but that is a separate issue.

cartermp commented 1 year ago

@NobbZ where did you look when you expected to find an explanation + sample?

I agree, it is a big gap. We don't describe generally how context propagation works, nor do we have a dedicated section + example for each language.

svrnm commented 1 year ago

I've wanted similar to what is described. If every language implemented the same 2 client/server services then they could be swapped between in a docker-compose file and you expect the same result.

ACK, here's how I see this in the future:

Step 1, we have the "roll the dice" web server as "getting started app" for all languages
Step 2, ~~there is a second route/endpoint called /rolldiceRemotely (or something) that calls /rolldice on a downstream service~~, there is a second route/endpoint called /battle or /compete which calls /rolldice on 2 downstream services and compares the result

But there's a few things that need to be done to get there :)

NobbZ commented 1 year ago

A good starting point to learn is the demo which gives you a complex environment to play with,

The demo is hidden well. I searched the documentation and found no hints, I asked google with a couple of keyword combinations, I tried various things in the GH org repo search (but not "demo" obviously).

Perhaps linking it from the Documentation instead of "community" might help with discoverability.

Also a big problem seems to be, I can not make it work…

Something wants to bind [::]:8080 which is already in use and I do not want to stop that service, I want to run the demo binding another port, and also not ::!

The system is lacking documentation how to change this.

By digging the compose file, I found ENVOY_PORT, but setting that to 8081 does make the services start and persist, though even after 15 minutes of waiting I get a 503:

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address

No clue how to proceed…

instrumentation libraries

I checked hex.pm and the plug, cowboy and phoenix (server side) instrumentation libraries I found did not talk about taking a context via request, just starting a span for each, so the system remains in isolation.

I haven't found an instrumentation library for httpoison (HTTP client) at all.

Though what I get from skimming the sources of the elixir/erlang based service in the demo, it seems as if I need an interceptor, though this word again is not mentioned in the docs at all… I have not found any documentation about how to create one. In general the sparse docs and the source code for the elixir and erlang libraries seem to use a lot of domain vocabulary that is not explained anywhere.

tsloughter commented 1 year ago

An interceptor is grpc specific. The propagation to the featureflag service is over grpc so it has the grpcbox interceptor enabled.

There is no httpoison instrumentation library that I am aware of. Taking a quick look it appears httpoison supports a sort of middlewaring (handle_request_headers), so it would be simple enough to create one -- relatively speaking, I was thinking in terms of compared to hackney which has no such helpers :), not in the sense the docs are in a state that make it simple to create a new library.

tsloughter commented 1 year ago

Oops, I guess I should have googled first https://github.com/primait/telepoison

I need to get them to submit that to the contrib repo :)

NobbZ commented 1 year ago

In this case httpoison was just the library I happened to use for the prototype.

And I wouldn't actually need to trace that one (unless I can extend the trace to include ElasticSearch processing the query).

It is basically the end of everything we can observe. Though I will need to find ways for tracing from a JS frontend over backends in JS, Ruby and Elixir, which do some back and forth.

And despite the documentation situation, I still think OT.io is the correct tool for that job.

And this is something I will continue to prototype even in my freetime, as this really would solve some urgent pain I have with the reliance on rural knowledge within the team…

chalin commented 1 year ago

@svrnm - tag this as an enhancement request rather than a bug?

open-telemetry / opentelemetry.io

Provide examples for Inter-Service-Tracing #1949