open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.23k stars 765 forks source link

Migrate BatchExporterProcessor to async #5838

Closed verdie-g closed 3 weeks ago

verdie-g commented 2 months ago

Fixes #5778

Changes

This makes BatchExporterProcessor async to avoid having dedicated threads that are idle 99% of the time.

I first tried to make the implementation close to the original synchronous implementation where only the ExportProc method was allowed to call BaseExporter.ExportAsync but it was a unreviewable deadlock fest. This proposed implementation allows any method (Export, Flush, Shutdown) to also call ExportAsync but always limit a single concurrent export. This greatly simplifies the code and only took me 30 minutes to implement.

It is very important that we first discuss about how BaseExporter.ExportAsync will be implemented on all available exporters before merging this change, otherwise, this is going to push synchronous job on the thread pool which will make the situation worse than before.

Merge requirement checklist

cijothomas commented 2 months ago

Can you describe the overall idea on how this will be implemented? This a core functionality, and there cannot be any breaking changes at all. It is intentional that the current design spins up own thread and not rely on async tasks/thread-pool. And also the benefits of this change.

verdie-g commented 2 months ago

The exporter loop is kept but is now running on the thread pool using a simple Task.Run. The Export, Flush, and Shutdown operations were allowed to speed up the exporting by disabling the delay between two exports. In this change, it's done differently to simplify the code. The 3 operations are allowed to call the BaseExporter.ExportAsync method but only one concurrent export is allowed. BaseExporter.ExportAsync is a new method that by default does async-over-sync by calling Export just to avoid a breaking change. Though, it is important that this method gets overridden by the different implementations (zipkin, OTEL) to avoid pushing synchronous operations on the thread pool.

The idea of a thread pool is to avoid the cost of the creation/deletion of a thread as well the cost of the resources associated to the thread. But most importantly, it avoids having to perform a context switch to yield the execution to another task. In an ideal application, the thread pool has as many thread as CPU cores, and each thread never leaves its core. Each context switch will impact the throughput of the thread pool. If there is only a few extra threads used, it's fine, but if all libraries start spawning dedicated threads for their background jobs that can be an issue. Here, 3 threads are spawned, and maybe a 4th one will be added for the profiling signal. I won't have any figures to share but the OTEL library could become a better citizen in the .NET ecosystem by using the thread pool.

It is also important to note that currently, only on a part of the work done by the export is performed on the dedicated thread. When HttpClient is used, or any other network class, the thread pool will still be used.

Additionally, this will ease having an asynchronous flush method in the future, which I believe a user request will appear some day about that as you never want to block the thread pool and currently it does.

BONUS1: we get to remove the synchronousSendSupportedByCurrentPlatform. BONUS2: might fix https://github.com/open-telemetry/opentelemetry-dotnet/issues/2816

verdie-g commented 1 month ago

It is synchronous. And the specification says it must be so.

I don't think that's true. Here is a quote from the spec:

Depending on the implementation the result of the export may be returned to the Processor not in the return value of the call to Export() but in a language specific way for signaling completion of an asynchronous task.

Because of that we're always going to run into somewhere having to jump from sync to async back to sync.

I've started implementing the ExportAsync method around, and found cases where we need to synchronously wait on a Task but I don't think I found a case where a synchronous operation occurs after an asynchronous one. Actually, I'm not sure what you mean. Could you clarify that point?

The telemetry pipeline being synchronous is a good thing. We don't want to block real work being done on a thread because it tried to write some telemetry and the SDK has to wait on a busy thread pool.

When a context switch occurs to work on the OTEL thread, it will actually prevent a thread pool thread to do its job. Using the thread-pool and async exports would greatly reduce that overhead.

How will SimpleExporterProcessor work with ExportAsync? Are we expecting SimpleExporterProcessor to continue calling Export or will it call ExportAsync?

Either should be fine. If a network exporter is used with SimpleExporterProcessor, there is obviously no guarantees about the performance, so I don't think ExportAsync will change anything here.

It is also worth noting that because of the synchronous export API, some exporters need to do some sync-over-async on some platforms.

Depending on that answer, can we see some of the existing exporters (maybe start with Otlp) updated as a demonstration? We require Export be implemented via abstract. Authors will have to code something synchronous. If we offer ExportAsync how does that help with this? I don't want to see this...

For OTLP, the Export method will indeed be doing sync-over-async on the ExportAsync. But it does not matter because only ExportAsync would be called.

What I'm trying to figure out is, if I'm an exporter author, and I have to do a synchronous version anyway, why would I bother with an asynchronous version?

I think it's very similar to how the Stream API works in .NET. If we focus on the read API, by default, you only need to implement Read(byte[] buffer, int offset, int count). Other methods, such as ReadByte or ReadAsync will fallback on Read. The stream will work fine but you'll get a performance hit until you implement the other methods.

Here is a very concrete example https://github.com/dotutils/streamutils/pull/2 where some benchmarks showed that this Stream was a performance bottleneck, so extra Read methods were implemented to fix that.

In the case of ExportAsync it will be the same. Though, it is to expect to have way less implementations of BaseExporter than implementations of Stream in the wild. Also, most of the BaseExporter implementations are under our control in the OTEL repositories.

github-actions[bot] commented 1 month ago

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

verdie-g commented 1 month ago

Here is what the implementations of BaseExport.ExportAsync could look like: https://github.com/verdie-g/opentelemetry-dotnet/commit/02eb4503bcd5dfcb415351f1791ddc3b913eb5a7

github-actions[bot] commented 1 month ago

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] commented 1 month ago

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] commented 3 weeks ago

Closed as inactive. Feel free to reopen if this PR is still being worked on.

verdie-g commented 3 weeks ago

Kind of awkward that the PR gets automatically closed when it's waiting for a review 🤔 @CodeBlanch @cijothomas could you have a look at it again? The TL;DR is that you don't create threads in .NET. Some platforms don't even support it. And OTEL, which will be probably become the most downloaded nuget, creates 3-4 of them when it could use the thread pool.

CodeBlanch commented 3 weeks ago

Sorry @verdie-g the short version is this isn't a priority right now. Sorry! That doesn't mean we don't ❤️ the passion and the contribution. We really do, we've just got a lot to get done before we release 1.10.0 stable next month. This is a big change so I'm not going to consider it for inclusion this late in the cycle. We can come back to it for the next release though.

madskonradsen commented 16 hours ago

Could be amazing to have this merged at some point. A bit frustrating that OTEL doesn't even work with Blazor because of missing threading in the browser :)