userver-framework / userver

Production-ready C++ Asynchronous Framework with rich functionality
https://userver.tech
Apache License 2.0
2.37k stars 275 forks source link

feat traces: potentially incorrect start and end times of trace spans #677

Closed TertiumOrganum1 closed 2 weeks ago

TertiumOrganum1 commented 2 weeks ago

We've observed potentially incorrect startTime and endTime values for spans in distributed traces when using OTLP (and even before). This issue was identified after extracting spans from OTLP and analyzing their start-end times.

Comparing to traces produced by golang applications (router and gateway): spans_go

userver traces look weird: spans_userver

Observations:

  1. When calculating relative (to parent) and absolute (to root) time shifts in milliseconds, we noticed unusual database access times (around one second).

  2. Paradoxically, the start times of spans relative to the root span are in the millisecond range, even if span in 2nd service caused by operation in 1st service.

  3. This discrepancy might be caused by the coroutine engine, possibly recording the time when a task is queued rather than when it actually starts executing.

Steps to Reproduce:

  1. Extract spans from OTLP traces generated by a userver-based application.
  2. Calculate relative and absolute time shifts for each span.
  3. Analyze the resulting timings, paying attention to database access spans and their relation to the root span.

Additional Information:

Questions:

  1. Is this behavior expected or a potential bug in span timing calculation?
  2. If it's expected, could you provide clarification on how span timings are determined in the context of the coroutine engine?
  3. If it's not expected, what could be causing this discrepancy, and how can it be addressed?

Impact: This issue affects the accuracy of performance metrics and may lead to misinterpretation of trace data, particularly for database operations.

We would appreciate any insights or clarification on this matter. Thank you for your attention to this issue.