launchdarkly / dotnet-core

LaunchDarkly monorepo for .NET packages
Other
1 stars 0 forks source link

.NET LaunchDarkly SDK incompatible with Orleans or other single-threaded TaskSchedulers #32

Open alexn-tinwell opened 8 months ago

alexn-tinwell commented 8 months ago

Describe the bug

We've found that unless we call Events(Components.NoEvents) on the SDK, all Orleans mesh activity is halted at app startup around the same time that an LdClient singleton is first instantiated.

For context, Microsoft Orleans is a virtual actor / service mesh framework. Grain (actor) instances are executed with a custom task scheduler that is intentionally single-threaded so that the core framework managing the instance mesh can reason about the resource impact of each grain activation and balance them across nodes as required.

Unfortunately, turning off LaunchDarkly analytics events also appears to disable auto-creation of new ContextKinds. We can reference User contexts just fine, but when trying to create a custom context kind for backend services, the context is never shown in the LD dashboard unless we turn events back on, but in that case the app hangs shortly after boot.

I theorise that it's related to the internal implementation of EventProcessorInternal, which starts the main message loop using Task.Factory.StartNew (which uses TaskScheduler.Current internally). In scenarios where LdClient is instantiated in the context of an Orleans Grain, this could block the only thread on the scheduler and prevent other work from ever occurring.

Furthermore, the implementation of RunMainEventLoop is inefficient because it pegs that thread with no slippage or sleeping between loop iterations whatsoever, so that thread is unlikely ever to relinquish context to other waiting tasks even when there is no work to perform.

For reasons I can't quite understand, EventProcessor implements no interfaces and is sealed, so we can't resolve this ourselves short of forking the SDK.

To reproduce Create any sample Orleans application with a couple of grains, declare LdClient as a singleton for the Silo, and observe communication timeouts between grains.

While I couldn't prove my theory by modifying the SDK code or overriding the implementation, I did find a correlating fix.

Spot the difference:

Hangs any app using a single-threaded task scheduler:

    public static void AddLaunchDarklyClient(this IServiceCollection collection, ISecrets secrets)
    {
        var launchDarklySdkKey = secrets.LaunchDarklySdkKey();
        if (launchDarklySdkKey.IsBlank())
            throw new ArgumentException(
                $"LaunchDarkly SDK key is blank. Please set the environment variable {Constants.LaunchDarkly.Secrets.SdkKey}.");

        collection.AddSingleton<ILdClient>(x =>
        {
            var builder = LaunchDarkly.Sdk.Server.Configuration
                .Builder(launchDarklySdkKey)
                .DiagnosticOptOut(true)
                .Build();

            return new LdClient(builder);
        });
    }

Does not hang the app because the client is instantiated on a threadpool thread using Task.Run:

    public static void AddLaunchDarklyClient(this IServiceCollection collection, ISecrets secrets)
    {
        var launchDarklySdkKey = secrets.LaunchDarklySdkKey();
        if (launchDarklySdkKey.IsBlank())
            throw new ArgumentException(
                $"LaunchDarkly SDK key is blank. Please set the environment variable {Constants.LaunchDarkly.Secrets.SdkKey}.");

        LaunchDarkly.Sdk.Server.Configuration? builder = null;
        LdClient? client = null;

        // This is a hack to get around the fact that LaunchDarkly expects to peg a thread on the current TaskScheduler, meaning
        // it starves Orleans
        Task.Run(() =>
        {
            builder = LaunchDarkly.Sdk.Server.Configuration
                .Builder(launchDarklySdkKey)
                .DiagnosticOptOut(true)
                .Build();

            client = new LdClient(builder);
        }).GetAwaiter().GetResult();

        collection.AddSingleton<ILdClient>(x =>
        {
            return client;
        });
    }

Expected behavior Merely instantiating the SDK should have no adverse affects on the hosting application nor peg a CPU thread.

Logs If applicable, add any log output related to your problem.

SDK version 8.0.0.0 but this was also present in 7.x. We ignored it until now because we weren't using events, but now want to reference custom ContextKinds and couldn't get them working until we turned events back on.

Language version, developer tools C# / .NET 7

OS/platform Any

Additional context If you must start your own long-running threads inside an SDK instead of exposing an IHostedService, always pass in TaskScheduler.Default (and ideally use Task.Run instead of Task.Factory.StartNew)

tanderson-ld commented 8 months ago

Thank you for bringing this to our attention @alexn-tinwell. We will discuss on team and get back to you later this week.

tanderson-ld commented 8 months ago

@alexn-tinwell, we investigated and were able to identify this change as having changed the code from using Task.Run to using Task.Factory.StartNew.

Documentation here seems to indicate we should use Task.Run. We will likely need to refactor the RunMainEventLoop to use an async mechanism to support usage of the SDK in a single threaded environment.

You mention steps to reproduce being to create a Orleans project. Are you able to provide the source code for a sample project in which you were able to reproduce the issue? That could help us iterate more quickly as we work to solve this. I am also unsure if there may be other logic/components in the SDK that don't operate well in a single threaded environment and can't guarantee any timeline for this refactoring to be completed.

We have long term goals of refactoring several of our SDKs to use native async support in each respect language, so this work may get lumped in to that.

Does the workaround you have of instantiating the client in Task.Run unblock you?

alexn-tinwell commented 8 months ago

(apologies for the notification noise; I fat-fingered this response while drafting)

@tanderson-ld Thanks for looking into this!

In the context of Orleans, it's not that it's single-threaded per se, it's that it overrides the Current TaskScheduler (in other words, code that references TaskScheduler.Current will essentially be using a threadpool of size 1.

Task.Factory.StartNew uses TaskScheduler.Current internally unless told otherwise.

An interim improvement on your end could be:

Task.Factory.StartNew(
            () => RunMainLoop(messageQueue, buffer),
            CancellationToken.None, // Or ideally introduce an actual cancellation token source as part of this change too
            TaskCreationOptions.LongRunning,
            TaskScheduler.Default
        );

I haven't tested as we're heads-down on something rn, but this should avoid the issue entirely because this would start your long-running process on the default threadpool rather than the one being used by the caller of the SDK.

We have been unblocked by the Task.Run bandaid yes. I think that's basically getting the SDK to start up using the regular threadpool (since Task.Run uses TaskScheduler.Default internally) so by the time our ILdClient is injected into a subsystem using Orlean's single-threaded TaskScheduler it has already spawned the long-running process.

You may prefer to replicate this by simply implementing a custom TaskScheduler with the same 1-thread limitation and then launching the SDK within a call to Task.Factory.StartNew passing in your custom scheduler. The current SDK will likely run on that single thread, and then by changing to TaskScheduler.Default within the SDK, you'll break off that single thread and run on the default threadpool. See here for more details https://chat.openai.com/share/6811c6fd-4a1f-4e2c-9fe9-927a64ee8aee

alexn-tinwell commented 2 weeks ago

👋 we're evaluating an SDK / API version upgrade at the moment and I was reminded about this issue. Has there been any further thought / progress on resolving it?