Open mecseid opened 2 years ago
Zipkin has a "shared span" approach to RPC calls. In this case the client and server span share the same span IDs. We account for that on the query path, but I'm wondering if the service graph code handles it as well.
https://zipkin.io/zipkin/2.11.3/zipkin/zipkin2/Span.html#shared-- "When an RPC trace is client-originated, it will be sampled and the same span ID is used for the server side."
From a glance, I can't really see why this would trip up the service graphs processor. I will have to write a test and dig a bit deeper into how Zipkin traces are processed internally.
@joe-elliott @kvrhdn Thank you so much! We make the traces with Spring Cloud Sleuth, which has stable integration only to Zipkin, and an experimental to OTel (as far as I know) 😞.
If I can help anyhow, please tell me.
Hey Peeps! Is there anything that I can help with this?
@nicolastakashi we don't have any zipkin instrumented applications internally so this makes it quite difficult for us to diagnose and fix this. If you are producing zipkin traces (and feel up to the work), we'd love to see a PR to the metrics generator that makes service graphs work for zipkin.
The basic problem is that the client and server spans share the same trace id and the same parent id in zipkin. The current code expects the server span to have a parent id equal to the client span's id. It uses this relationship to match up two spans and create an edge:
It would require some thought, but I'm hopeful we can create logic that would correctly process both types of spans.
Jaeger/OTel:
server span (id: 1, parent id: 0)
client span (id: 2, parent id: 1)
Zipkin:
server span (id: 1, parent id: 0)
client span (id: 1, parent id: 0)
We have code that does this on the query path: https://github.com/grafana/tempo/blob/main/modules/frontend/deduper.go#L125
but I'm not sure how applicable it is here.
@joe-elliott I'll try my best to setup it locally, I've services sending Zipkin spans and I guess I can debug it quite easily.
Hey @joe-elliott I could run it locally and debug it, with some effort I guess I could understand the issue that you've described.
Let me share a few tools that can help you in the future to test Zipkin scenarios
1 - There is a repository where you can find JSON examples of Zipkin traces and send a post to Tempo Distributor as you can see here.
2 - If I'm not wrong the synthetic-load-generator
has an option where you can pass a flag called --zipkinV2JsonUrl
there we can also point to the Distributor.
Regarding the main issue, if a little bit of effort, I guess I could understand that and I would like to check if you before moving forward.
I was thinking to use the same dedupe strategy mentioned here before executing the consume function on the servicegraphs.go
I've checked and I guess that contract is different and because of this we should duplicate de code or try to abstract the deduper module.
WDYT?
The issue with using the frontend deduper is that it expects the entire trace to be present. Currently the service graph processor holds parent or child spans in memory and watches for the relevant match. The way a "match" is lined up is what will need to change.
Right now I believe it's looking for two spans with kind server and client. The server span's parent id has to match the client span's id. Also, recently support for producer/consumer was added as well, but with the same span id requirements.
@joe-elliott If I disable this shared span functionality, will it work as a workaround?
To make the question specific for Spring Sleuth, here is the corresponding property: https://docs.spring.io/spring-cloud-sleuth/docs/current/reference/html/appendix.html
spring.sleuth.supports-join (default true)
True means the tracing system supports sharing a span ID between a client and server.
I do know that "shared span" is what zipkin calls the model where the same span id is shared between client and server, but I have no idea what that setting does :).
Give it a try and let us know!
@joe-elliott Update for https://github.com/grafana/tempo/issues/1495#issuecomment-1214421927: Looks like, if I disable this function, then it works as expected, so whoever use Spring Sleuth can disable this function and use Tempo as expected. :)
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
Describe the bug Made some request like the attached trace JSON, but the metrics_generator not created any metric about service graph.
To Reproduce I attached the JSON export of the trace. trace.json.txt
Expected behavior Generate service graph with 4 service (api-gateway, sales-service, crm-service, document-generator-service).
Environment:
Configuration Attached tempo.yaml file: