Can the API avoid returning shared_ptr

bogdandrutu commented 3 years ago

See: https://github.com/open-telemetry/opentelemetry-cpp/blob/main/api/include/opentelemetry/trace/tracer.h#L39

By having this design an atomic ref-counting will always be needed. This will block in the future a possibly more performant "streaming" implementation where Span == SpanContext and all "record" apis will stream an event. The reason for "blocking" is because will still pay the refcounting and heap allocation cost even though is not needed.

The SDK can implement the Span by having an internal shared_ptr using a pimpl pattern.

bogdandrutu commented 3 years ago

Same comment should apply for "Baggage" as well. It should allow an alternative implementation (even though is not a proper implementation but could be a header replace) that does not require Baggage to be heap allocated and refcounted.

jsuereth commented 3 years ago

We (Google) would be a fan of this change, giving us freedom to control performance going forward with differing implementations

pyohannes commented 3 years ago

This should be doable.

I'm not sure if we can preserve the ability to automatically end a span when it is deallocated (or goes out of scope). We might end up requiring explicit calls to End for all spans that are not managed by a Scope.

bogdandrutu commented 3 years ago

You can achieve exactly the same behavior by having internally a shared_ptr<SpanImpl> and have the copy ctor/assignment do ref and destructor unref, and when out of scope you call End. It is just a variant that does not change the behavior but changes the ownership of the reference object, and allows alternative implementations that can avoid using reference counting.

lalitb commented 3 years ago

Had a quick look into span api code, the problem I see here is that api::Span is an abstract class, which means we can't return a copy of it, so we have to return a reference or pointer of it ( holding the instance of sdk::Span), and this will introduce the typical ownership/memory issues where application/instrumentation-code is still holding the reference/pointer to object which has been already deleted.

For the rest of the components like Baggage / TraceState, we can use the Pimpl idiom to get rid of returning shared_ptr as implemented here: #836

lalitb commented 3 years ago

Just to add on top of the above comment, while we can add a non-abstract Span at API surface which internally contains a pointer to sdk::Span, and return this instead of shared_ptr. But this will make it difficult to maintain ABI compatibility. Any changes in internal structure of this non-abstract Span can break ABI compatibility. Returning shared_ptr allows us more flexibility to make internal changes without breaking ABI. Same holds true for Baggage api.

ThomsonTan commented 3 years ago

Could we add a new API to support returning pointer/reference to Span? Even we don't guarantee backward compatibility, we could try to not break existing customer.

pyohannes commented 3 years ago

One possible approach for this, which can be implemented very easily:

Return api::Span as a unique_ptr when a span is started, only convert it into a shared_ptr if it is set as the currently active span in the context. Because only in this case, the ownership of the span is truly shared (one copy is owned by the context, another by the user). With this solution users who want to benefit from the more efficient approach mustn't use automated context management and parenting.

Anyway, if we decide to go without using shared_ptr we'd need to start doing some prototyping, as the devil is in the details.

bogdandrutu commented 3 years ago

@lalitb can you please explain a bit more

But this will make it difficult to maintain ABI compatibility. Any changes in internal structure of this non-abstract Span can break ABI compatibility. Returning shared_ptr allows us more flexibility to make internal changes without breaking ABI. Same holds true for Baggage api.

What is the difference in your baggage prototype from ABI compatibility? If you change the current Baggage class is it not the same?

lalitb commented 3 years ago

What is the difference in your baggage prototype from ABI compatibility? If you change the current Baggage class is it not the same?

Returning Baggage as an object exposes the memory layout of the object to the caller, and the caller can start having a dependency on this memory layout which can break ABI compatibility. As an example, if the caller is using sizeof(Baggage) in their code, and we add a new private data member in the Baggage class, this will break the caller code. Returning shared_ptr<Baggage> won't give caller this flexibility.

Though you are correct for the baggage prototype, as most of the implementation changes would happen in BaggageImpl, which will not break the ABI compatibility.

maxgolov commented 3 years ago

Can we have an implementation that performs the following:

using SpanValue = nostd::shared_ptr - to maintain backwards compatibility with v1.0.0-rc1 API. I already have customers using this API surface since v0.6... Then return that. And alternate implementation may decide how to override that implementation of SpanValue with a better performing implementation.
perf concern is somewhat exaggerated for spans with "reasonable" duration, i.e. not a synthetic benchmark that emits 0-microseconds spans, but spans that actually perform some measurable amount of work.. Otherwise for immeasurable amount of work less than 0ms, I'd argue the instrumentation is unnecessary. You'd want to monitor and optimize the hot-path that is consuming most CPU cycles, not overburden it with unnecessary span creation? It'd be hard to measure the perf difference on non-synthetic benchmark. I'd like to see the concrete numbers from Google Benchmark. Something like this, but in a loop that constitutes some reasonable amount of work.
then assuming we have SpanValue, as value object that possibly uses PIMPL to wrap a handle (I was always wondering why we use shared_ptr instead of returning something that is a handle), we could have allowed the alternate implementations that are better optimized for the streaming scenario.

By the way, ETW exporter is a "streaming" implementation. We can do measurements to tell how big the overhead is of the shared_ptr. ETW exporter / TracerProvider implementation does not use a batcher / background thread, it's a pass-thru of all spans and events to ETW layer, that allows to reassemble and/or filter the spans and events in batches by out-of-proc agent listener.

My biggest concern is refactoring a major piece something that reached v1.0.0-rc1. I'd like to see some smooth transition to a better model instead of a breaking change. If SpanValue is not a good name, let's come up with a better name. But at least tackle this in a non-breaking, incremental, fashion.

lalitb commented 3 years ago

The solution would depend on what we are trying to achieve here by removing ABI stable smart-pointer from API surface (sorry but I am still trying to get more clarity):

performance improvement - It would be interesting to see if we can achieve performance improvement by removing shared_ptr from the API surface. As Span object would be shared at multiple places, we would internally still need a way to manage the reference count of this object ( either using pimpl approach by storing shared_ptr<SpanImpl> as data member or some other way) In fact, this may add marginal cost for another layer of access redirection.
prevent unsafe smart pointer operations - If we want to prevent instrumentation libraries do unconscious unsafe operations with this smart pointer. Eg, trying to delete the managed raw pointer explicitly making the application crash, it makes sense to hide smart_pointer from the API surface. Though this will break backward compatibility.

lalitb commented 3 years ago

Ok, invested some time in prototyping span creation api to return unique_ptr instead of shared_ptr<Spam> here. Here are the benchmark results in identical load conditions:

Existing api ( using shared_ptr):

Run on (8 X 2112 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.09, 0.06, 0.02
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_SpanCreation                                        266 ns          266 ns      2665824
BM_SpanCreationWithScope                              2227 ns         2227 ns       310808
BM_NestedSpanCreationWithScope                        7332 ns         7332 ns        91774
BM_SpanCreationWithManualSpanContextPropagation       1110 ns         1110 ns       626839
BM_SpanCreationWitContextPropagation                  6608 ns         6608 ns       103034

new api returning unique_ptr and using span's raw pointer internally to manage span context:

Run on (8 X 2112 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.10, 0.07, 0.02
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_SpanCreation                                        468 ns          468 ns      1073888
BM_SpanCreationWithScope                              2270 ns         2270 ns       310252
BM_NestedSpanCreationWithScope                        7384 ns         7384 ns        92055
BM_SpanCreationWithManualSpanContextPropagation       1486 ns         1486 ns       463668
BM_SpanCreationWitContextPropagation                  6435 ns         6435 ns       103529

After doing multiple iterations, haven't seen a significant difference in performance in both scenarios. As we get rid of shared_ptr, there is extra overhead to manage the cleanup of Span from Context once Span gets deleted.

Another thing to notice is that there is a significant performance difference between Span creation with and without Scope (first two readings taken from either of the above scenarios), and it increased further with nested spans ( 3 reading above). This means we should revisit our existing linked-list / stack data structure used to manage context/spans. I will create a separate issue to investigate that, We can discuss this in the next community meeting.

ps. If you would like to review the changes done for conversion to unique_ptr: https://github.com/lalitb/opentelemetry-cpp/compare/main...lalitb:span-pointer?expand=1 ,relevant changes are confined to these files: api/include/opentelemetry/context/context_value.h api/include/opentelemetry/trace/scope.h api/include/opentelemetry/trace/span.h api/include/opentelemetry/trace/tracer.h

Code used for benchmarking: https://github.com/open-telemetry/opentelemetry-cpp/blob/main/api/test/trace/span_benchmark.cc

maxgolov commented 3 years ago

So if the new proposed approach degrades perf of span creation x1.75 times, and/or does not improve performance overall - for the sum across all tests (not a good metric though), is worse - because of BM_SpanCreation and BM_SpanCreationWithManualSpanContextPropagation , this kinda defeats the purpose, or disproves the actuality of original expectations?

The reason for "blocking" is because will still pay the refcounting and heap allocation cost even though is not needed.

Why would we be doing that again? I think shared_ptr semantics allow us to pass Span objects to worker threads, properly representing partial ownership and safe disposal of the objects if- and when- needed. Performance point is clearly moot to say the least. auto keyword addresses most inconveniences. And conversion to unique_ptr does not yield any perf improvements, while taking away the partial ownership semantics.. I think it's not good, esp. when v1.0 API is stable, working, and many people are using it / trying it out successfully.

lalitb commented 3 years ago

This was discussed in the last weekly meeting. After evaluating the current benchmark results and comparing it with that of unique_ptr based prototype, it is planned -

to put on hold the API changes related to shard_ptr for now while we are nearing the release.
revisit this post GA if any related performance issue is reported. Any new ABI changes can be added as per the Versioning guidelines here while maintaining ABI stability.
Evaluate our internal data structures used to manage Context for any performance issues as indicated by current benchmark results.

@bogdandrutu - please let us know if you see any concerns. We can keep this issue open and tag it for post-GA accordingly.

luxapana commented 3 years ago

@lalitb Could I know if above benchmark simulated thread contention? In our application also, thread un-safe implementation of the SDK would be required due to low latency requirements. So +1 for any initiative which aims that.

lalitb commented 3 years ago

@manushawijekoon - Not sure what you mean by the thread un-safe implementation of SDK as concurrency at API level is one of the objectives of specs ( https://github.com/open-telemetry/opentelemetry-specification/blob/b46bcab5fb709381f1fd52096a19541370c7d1b3/specification/trace/api.md#concurrency ).

Regarding shared_ptr removal or providing alternate methods which return Span's unique_ptr / pointer ( at the cost of removing the context management feature specifically for these methods ), this can be considered, and have kept the issue opened for the same.

luxapana commented 3 years ago

This will limit its application in low latency systems. However that spec didn't mention anything about API concurrent for metrics

but I guess eventually that will be the case also. Here is a nice blog about thread contention. https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

I am talking about level 0 (or even lower) type implementations based on the terminology of the above blog.

On Wed, Oct 13, 2021 at 3:43 AM Lalit Kumar Bhasin @.***> wrote:

@manushawijekoon https://github.com/manushawijekoon - Not sure what you mean by the thread un-safe implementation of SDK as concurrency at API level is one of the objectives of specs ( https://github.com/open-telemetry/opentelemetry-specification/blob/b46bcab5fb709381f1fd52096a19541370c7d1b3/specification/trace/api.md#concurrency ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-telemetry/opentelemetry-cpp/issues/828#issuecomment-941642496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK4QRXQUV53XJ4JTNQHMIL3UGSXJJANCNFSM46IBDCEQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

lalitb commented 3 years ago

@manushawijekoon - Thanks for the link. Just to let you know,

the context management is done on thread-local storage https://github.com/open-telemetry/opentelemetry-cpp/blob/5aa73655b5bb1af0be22463255b62e6d2f7dd10f/api/include/opentelemetry/context/runtime_context.h#L310 to ensure that context is managed per thread level, which mitigates thread contention.
Mutual exclusion is achieved using mutex over atomic flags and spin-locks to ensure not blocking threads. More here: https://github.com/open-telemetry/opentelemetry-cpp/blob/5aa73655b5bb1af0be22463255b62e6d2f7dd10f/api/include/opentelemetry/common/spin_lock_mutex.h#L52 .
As mentioned earlier, providing alternate methods to create Span without shared_ptr is in plan with no deadlines for now.
The Metrics API is under development as per the latest specs, and so will change in near future ( we may get rid of shared_ptr wherever required ). The concurrency requirements for metrics is here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#concurrency-requirements.

github-actions[bot] commented 2 years ago

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

github-actions[bot] commented 2 years ago

Closed as inactive. Feel free to reopen if this is still an issue.

lalitb commented 2 years ago

keeping this open.

yuvalif commented 2 years ago

the ceph project will also be interested in using non shared_ptr spans. we did some benchmarking of our code using no-op spans and found they have a detectable impact on perf.

lalitb commented 2 years ago

@yuvalif - Do you have more details, which part of the Span life cycle has a detectable impact - creation, populating, or export/end? How you did these non-shared span tests, it would be interesting to reproduce this locally. Meanwhile, there is ongoing work on adding benchmark tests as part of CI (#1170) to get more stats.

yuvalif commented 2 years ago

@yuvalif - Do you have more details, which part of the Span life cycle has a detectable impact - creation, populating, or export/end? How you did these non-shared span tests, it would be interesting to reproduce this locally. Meanwhile, there is ongoing work on adding benchmark tests as part of CI (#1170) to get more stats.

our main focus (for now...) is to test the no-op tracer and no-op span (we want to make sure that there is no performance degradation when the feature is disabled). this means that the only cost is with creating and accessing the spans (as these are no-op spans they don't do anything else).

our open telemetry based tracing framework is here: https://github.com/ceph/ceph/blob/master/src/common/tracer.h https://github.com/ceph/ceph/blob/master/src/common/tracer.cc we currently use it in 2 of our daemons, the radosgw (e.g. here) and the OSD (e.g. here)

below are links for our benchmark testing. we use an S3 client called hsbench that sends PUT and GET requests to our system. these requests are going through both the radosgw and OSD daemons.

results when tracing is compiled out: https://paste.sh/_2tcdFnX#-0niw2jRRi-x59YCj53kaCsG
results when tracing is compiled in, and disabled (no-op tracer/span): https://paste.sh/3sKNqlHd#PaQWkUmZZloWalMvA1O6Nd_c we see: 100*(2178 - 2119)/2178 = 2.7% degradation for the GET operations (which are more lightweight, hence more sensitive to overhead).

we also did deeper perf analysis (this is harder to share), where creating the spans and the "Get" operation on the shared_ptr stand out as the difference between the 2 runs. furthermore, we reduced the number of spans per operation on the OSD from ~15 to 2, and degradation dropped to 1%.

lalitb commented 2 years ago

Thanks, @yuvalif for the details. Just to clarify the results:

GET request when tracing compiled out: 2022/01/12 08:03:16 Loop: 0, Int: TOTAL, Dur(s): 91.8, Mode: GET, Ops: 200000, MB/s: 8.51, IO/s: 2178, Lat(ms): [ min: 1.9, avg: 7.2, 99%: 13.1, max: 22.7 ], Slowdowns: 0
GET request when tracing compiled in, and disabled (no-op tracer/span): 2022/01/13 02:58:10 Loop: 0, Int: TOTAL, Dur(s): 94.4, Mode: GET, Ops: 200000, MB/s: 8.28, IO/s: 2119, Lat(ms): [ min: 2.0, avg: 7.4, 99%: 13.2, max: 30.5 ], Slowdowns: 0

So on average, there is 7.4 - 7.2 = 0.2 ms latency added with no-op tracer/span for ~15 spans?

Have you looked into the span creation benchmark tests at api level (using NoopTracer and NoopSpan) here? Though we still need to integrate them in CI, a quick test on them in my local setup gives these results:

$ ./span_benchmark
2022-01-16 00:23:13
Running ./span_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.02, 0.08, 0.09
------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_SpanCreation                                        346 ns          346 ns      1961246
BM_SpanCreationWithScope                              1949 ns         1949 ns       357864
BM_NestedSpanCreationWithScope                        5936 ns         5936 ns       117577
BM_SpanCreationWithManualSpanContextPropagation       1043 ns         1043 ns       660873
BM_SpanCreationWitContextPropagation                  5557 ns         5557 ns       125827

Results for BM_SpanCreation (single span creation) and BM_SpanCreationWithManualSpanContextPropagation (three spans creation with parent-child relationship) should be interesting to see as they are close to what your setup looks like. Implicit context management (i.e, using Tracer::WithActiveSpan to store the parent context on TLS ) does take a few extra CPU cycles, but I don't see that used in your code. Though benchmark code doesn't call methods to populate the span (Span::AddEvent, Span::SetAttributes, etc), still these seem significantly lower numbers compared to what you are getting.

We have done a few tests returning span's unique_ptr and dropped it after not seeing substantial improvements. If you want to test that in your setup, I can send you the branch with the changes. Based on the results, we can start discussing adding the support?

yuvalif commented 2 years ago

thanks @lalitb !

We have done a few tests returning span's unique_ptr and dropped it after not seeing substantial improvements. If you want to test that in your setup, I can send you the branch with the changes.

this would be great. we would try it out in our system and let you know if we see a difference in our setup.

github-actions[bot] commented 2 years ago

This issue was marked as stale due to lack of activity. It will be closed in 7 days if no furthur activity occurs.

open-telemetry / opentelemetry-cpp

Can the API avoid returning shared_ptr<Span> #828