open-telemetry / opentelemetry-cpp

The OpenTelemetry C++ Client
https://opentelemetry.io/
Apache License 2.0
862 stars 412 forks source link

Parent-Child discover error in examples/simple when built as release and shared library. #1014

Open Pupilsch opened 3 years ago

Pupilsch commented 3 years ago

Describe your environment My system is Ubuntu 18.04.5 LTS. And the compiler is gcc 7.5.0

Steps to reproduce I am trying to use opentelemetry as a shared library for my project. But I found that when I try to build it as release and shared library, there is some error in Parent-Child discover. The problem can be reproduced by just running the examples/simple/example_simple. Build in Debug(default) in cmake or static library wont show this error.

I use the following code to build. cmake .. -DCMAKE_POSITION_INDEPENDENT_CODE=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON cmake --build . --target all

What is the expected behavior? In example/simple the parent_span_id of f1 should be span_id of f2. And parent_span_id of f2 should be span_id of library. But both showing 0000000000000000.

What is the actual behavior?

{
  name          : f1
  trace_id      : cbbb2e3813a548c408de4b484dc723a1
  span_id       : ce90a9eaf3214e19
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826620364
  duration      : 7665
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
    service.name: unknown_service
    telemetry.sdk.version: 1.0.0
    telemetry.sdk.language: cpp
    telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : f1
  trace_id      : 3dc7f1b4da6e2eedff960b136ab6e305
  span_id       : a145e9e7fca8f666
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826785278
  duration      : 1634
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
    service.name: unknown_service
    telemetry.sdk.version: 1.0.0
    telemetry.sdk.language: cpp
    telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : f2
  trace_id      : 94d5abc1a0839e21a97ab9bb3564cbea
  span_id       : 8c0d473664eeeb48
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826616048
  duration      : 296248
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
    service.name: unknown_service
    telemetry.sdk.version: 1.0.0
    telemetry.sdk.language: cpp
    telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}
{
  name          : library
  trace_id      : 0f0bc2eebeb389b448502ac1f015ed1e
  span_id       : 72fdbadc8e352b5d
  tracestate    : 
  parent_span_id: 0000000000000000
  start         : 1634260358826597245
  duration      : 460802
  description   : 
  span kind     : Internal
  status        : Unset
  attributes    : 
  events        : 
  links         : 
  resources     : 
    service.name: unknown_service
    telemetry.sdk.version: 1.0.0
    telemetry.sdk.language: cpp
    telemetry.sdk.name: opentelemetry
  instr-lib     : foo_library
}

Also the there is a test failed passing.

/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:130: Failure
Value of: span2.at(0)->GetParentSpanId().IsValid()
  Actual: false
Expected: true
/home/xxx/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:142: Failure
Expected equality of these values:
  span1.at(0)->GetTraceId()
    Which is: 16-byte object <11-E7 89-39 C6-F1 C7-94 00-2A 0C-55 77-86 CC-F1>
  span2.at(0)->GetTraceId()
    Which is: 16-byte object <7A-F8 0A-45 00-73 E5-4C 09-3E D9-58 8D-D0 60-F3>
/home/chunhao.shen/src/opentelemetry-cpp/sdk/test/trace/tracer_test.cc:143: Failure
Expected equality of these values:
  span2.at(0)->GetParentSpanId()
    Which is: 8-byte object <00-00 00-00 00-00 00-00>
  span1.at(0)->GetSpanId()
    Which is: 8-byte object <4D-6D D3-1B 08-D4 0C-9F>
lalitb commented 3 years ago

This is weird. It works fine with ubuntu 18.04 + gcc-4.8 and ubuntu 20.04 + gcc-9.3 in our github CI environment. I did a quick test on ubuntu 18.04 and got the same behavior as yours. Need to debug it further. Thanks for reporting the issue.

Pupilsch commented 3 years ago

Note that the traceid is also different. Some guesses when compiler doing optimization, the thread local "Stack" in runtime_context behave funny. Just a thought.

cqhan777 commented 3 years ago

@lalitb This is our patch:

--- a/api/include/opentelemetry/context/runtime_context.h
+++ b/api/include/opentelemetry/context/runtime_context.h
@@ -300,7 +300,13 @@ class ThreadLocalContextStorage : public RuntimeContextStorage
     Context *base_;
   };

-  Stack &GetStack()
+  static Stack &GetStack() __attribute__((noinline, noclone))
   {
     static thread_local Stack stack_ = Stack();
     return stack_;

It seems to be a bug in gcc that somehow lost track of the static storage property for stack_ when it does the ISRA_CLONE. Disable cloning seems work-around this, although it pays some performance cost. The additional static is not a must, but since it does not need this, it's better to be a static to save a little parameter passing. I understand direct use of attribute is not recommended here, just for your reference. Hope this can help a little.

lalitb commented 3 years ago

Thanks, @cqhan777, @Pupilsch, this is helpful.

github-actions[bot] commented 2 years ago

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

github-actions[bot] commented 2 years ago

Closed as inactive. Feel free to reopen if this is still an issue.

lalitb commented 2 years ago

@cqhan777 @Pupilsch - Is this issue still relevant, and reproducible?

cqhan777 commented 2 years ago

Problem still exists but no longer relevant to us since we have the workaround.

github-actions[bot] commented 2 years ago

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

github-actions[bot] commented 2 years ago

Closed as inactive. Feel free to reopen if this is still an issue.

sunyata0 commented 2 years ago

@Pupilsch Did you figure out how to solve the issue ? I have exactly the same problem on my environment :

But it's working fine on my Ubuntu 20.04 with gcc 9.4.0 (and also 8.4.0) and C++ 17

lalitb commented 2 years ago

@Briac-nt If it is the same problem, the patch from @cqhan777 should work for you:

https://github.com/open-telemetry/opentelemetry-cpp/issues/1014#issuecomment-944821770

Please confirm.

sunyata0 commented 2 years ago

Hi @lalitb , unfortunately this patch doesn't work for me, the trick in the 1223 issue neither. I changed the Stack &GetStack() by static Stack &GetStack() __attribute__((noinline, noclone)) and added a console log to be sure the modification was live. The problem is still here, very strange...

Falmarri commented 8 months ago

I think I have a similar issue, but more complex setup. I'm compiling otel as a static library, but then building a shared library out of that static lib. And it doesn't seem to always lose the stack, only sometimes. Not sure how to provide more info, it's super complex and I only see this happening in my production system

lalitb commented 8 months ago

@Falmarri Thanks, can you also add the platform details here - (linux-distro, gcc version, build flags etc)

Falmarri commented 8 months ago

@lalitb This is built on Ubuntu 18.04.6 LTS using

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-pc-linux-gnu/10.5.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --disable-multilib --enable-default-pie
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.5.0 (GCC)

I'm not entirely sure this is the same issue now that I've looked a bit furhter. What I'm seeing is that the trace is actually being propagated correctly. That is, I do this

    auto current_ctx = opentelemetry::context::RuntimeContext::GetCurrent();
    propagator->Inject(carrier, current_ctx);

and the remote service correctly gets the right trace/span. So I actually do have the right context. However, the span in question is never actually exported, so that my span never shows up in my collector and the traces are pretty broken.

The reason I think it's related to this though is because it's always this 1 particular span, and this span is in a separate compilation unit from the rest of the shared lib. So how this works is that I have the following:

I realize this isn't a lot of information but like I said this only happens in our production system and I can't reproduce it on demand or figure out if there's a pattern