open-feature / spec

OpenFeature specification
https://openfeature.dev
Apache License 2.0
588 stars 35 forks source link

Consider the ability to opt-out from automatic functionality such as automatic provider ready events. #265

Open kinyoklion opened 2 weeks ago

kinyoklion commented 2 weeks ago

For providers which support changing their provider status at runtime the automatic events introduce additional complexity.

The provider needs to defer its handling of the status until after the automatic status provided by the OpenFeature API. Otherwise it would emit duplicate provider ready or error events. The handling can be somewhat complex for multi-threaded providers and SDKs.

It would be nice if the provider had meta-data, or other fields, which could allow for opting out of the automatic behavior. This may then be extended to other automatic functionality in the future.

Related discussion: https://github.com/open-feature/dotnet-sdk/pull/276#issuecomment-2163766166

toddbaert commented 2 weeks ago

I feel conflicted about this.

On the one hand, I see the value and the simplicity that can be gained here for underlying SDKs that emit events. On the other hand we're adding more complexity to the SDK, and cognitive load to provider authors.

One other question is I have is if this setting would also apply to if the init throws. Would we have some other indicator for that to prevent the same issue for duplicate error events?

kinyoklion commented 2 weeks ago

cognitive load to provider authors

I am not sure if this is true or not. I recently updated the LD node, python, java, and dotnet providers to handle this behavior and handling the mixed requirements of runtime events, initialization events, and the blocking behavior is by far the largest cognitive load in any of these provider implementations (aside from the node one, where we happen to have an underlying mechanism that matches well), and also very easy to get incorrect.

This optionality may or may not be the correct way to reduce the complexity. Is optimizing for the simple case the correct approach, or is it likely that this increases complexity more often than it decreases it?

Ultimately here the event channel is shared. The provider can produce events and the OF SDK can also produce events. There isn't a delineation of the types of events provided by both. So you have to encode into the provider that specific events are emitted automatically at specific points, and that those events also need to be emitted by the provider at different points.

toddbaert commented 2 weeks ago

I think I'm nearly convinced. What's your take on this?

One other question is I have is if this setting would also apply to if the init throws. Would we have some other indicator for that to prevent the same issue for duplicate error events?

Do you think we should have 2 settings? 1 for preventing emission of error and 1 for emission of ready? Or can we handle both with one?

Do you think this should be a field on ProviderMetadata? Or should we have a dedicated field on the provider interface itself for it?

cc @lukas-reining since I think some of his recent work with a config-cat provider runs into this exact issue.

lukas-reining commented 1 week ago

Having this data on the metadata feels pretty complicated to me @toddbaert @kinyoklion.

The initialization function is always async in JS and in other languages often too, or it is considered sync and blocking, so could we just prohibit the Ready event from being emitted by the init method? In this case we would document that a resolving init function always triggers the automatic dispatch of a ready event. Emitting it yourself during the init could the be considered a bug. And we could relatively simple (hope to not oversee something) even filter these events out during provider initialization and log that this should not be done.

I think this way we could still have the automatic event and prevent the double events. But I might be overseeing something.

kinyoklion commented 1 week ago

Having this data on the metadata feels pretty complicated to me @toddbaert @kinyoklion.

The initialization function is always async in JS and in other languages often too, or it is considered sync and blocking, so could we just prohibit the Ready event from being emitted by the init method? In this case we would document that a resolving init function always triggers the automatic dispatch of a ready event. Emitting it yourself during the init could the be considered a bug. And we could relatively simple (hope to not oversee something) even filter these events out during provider initialization and log that this should not be done.

I think this way we could still have the automatic event and prevent the double events. But I might be overseeing something.

The problem is that the underlying means of emitting the events is typically independent of knowing when initialization is complete for the underlying SDK, and that many languages have not had good async support for as long as things like JavaScript (or at all).

So you have blocking or non-blocking init (maybe by just setting a short timeout), and then you have a means to listen to events. In which case you have to use those events to orchestrate a return from the init method using threading events or other means (sometimes it can just be a task that gets resolved).

The python one I was using in the original example: https://github.com/launchdarkly/openfeature-python-server/blob/243a8dc515bcd16fe8c7d5a0913fbdfe6e073aa0/ld_openfeature/provider.py#L46

Another, but Java: https://github.com/launchdarkly/openfeature-java-server/blob/c6b2edff3f1b33cda0117aadde7aa52c67b61dbc/src/main/java/com/launchdarkly/openfeature/serverprovider/Provider.java#L140

It isn't just this behavior, but a number of things combine to make the SDK initialization much more complex than seems reasonable.

Emitting it yourself during the init could the be considered a bug.

When the events being emitted by the underlying SDK are threaded it is a bit tricky to say what happens during init. Which is demonstrated in the approach I took in the Java version.

Some of these things can be addressed by extending the functionality of the underlying vendor SDK. This will probably be what happens with the LaunchDarkly SDKs if the interface remains this way, but it does create additional barriers to adoption.

lukas-reining commented 5 days ago

So you have blocking or non-blocking init (maybe by just setting a short timeout), and then you have a means to listen to events. In which case you have to use those events to orchestrate a return from the init method using threading events or other means (sometimes it can just be a task that gets resolved).

Sure, but isn't this the typical way of handling it? Many SDKs have an async or blocking init function. For these, implementing an init function that only resolves when ready or blocks until it is ready, is trivial. If they only provide events, handlers can be attached before initialization and then the first "READY like" event can be used to resolve/unblock the init function. Which I would say is what you show in the examples.

Emitting it yourself during the init could the be considered a bug.

Then the events being emitted by the underlying SDK are threaded it is a bit tricky to say what happens during init. Which is demonstrated in the approach I took in the Java version.

If I am not overseeing anything this can be simply handled by only listening to the first READY event of the SDK. Every event after that could be replayed to the OpenFeatureEventEmitter of the respective provider.

I see the point about errors, but for most languages this might be handled by the OF SDK handling any error returned/thrown by the provider init method. This error can then be added to the ERROR event that a the SDK emits. Probably I am overseeing something but I think this is all relatively easy to handle without the complexity of more options and "behaviours".

Do you have an example where these do not work @kinyoklion?

kinyoklion commented 5 days ago

Everything can be made to work. So I have no examples where it doesn't work. My problem is with the level of complexity required. If generally this is considered an acceptable level of complexity, then there isn't much to add. (Basically I updated 3 of these in a row, and each one was very fiddly.)

I am having some trouble finding things to compare to because the majority of providers seem to not support init and shutdown or not support events or not support either, or they have not been updated to the latest SDKs. (Or they are JS SDKs which is much less of an issue, though I still need to update our node provider because I have not added runtime provider status, but it is very straightforward in a single-threaded environment.)

I do feel though that there are going to be a number of bugs for providers that do implement each of these functionalities. Because the most intuitive way to implement them, which is hook up an event handler and wait for initialization, isn't correct. In that it works fine, but does result in duplicate events.

lukas-reining commented 4 days ago

Everything can be made to work. So I have no examples where it doesn't work. My problem is with the level of complexity required. If generally this is considered an acceptable level of complexity, then there isn't much to add. (Basically I updated 3 of these in a row, and each one was very fiddly.)

I have seen in the examples you showed that these are also having the internal provider state still. In my view, having the "client managed" state and only this reacting to the init function and events after it is relatively low effort. The providers you showed are doing basically what I have seen in many applications using the SDKs directly.

Or they are JS SDKs which is much less of an issue, though I still need to update our node provider because I have not added runtime provider status, but it is very straightforward in a single-threaded environment.

I am having a hard time with this. JS is still concurrent and depending on what happens in the background, events might be emitted out of order at any time. If it really is a multi-threaded lang and things are happening at the same time we would have to protect it with a mutex or so, but I do not see a real difference between the concurrent and parallel case for this. But I could totally oversee anything so please correct me if I am overseeing something.

Because the most intuitive way to implement them, which is hook up an event handler and wait for initialization, isn't correct. In that it works fine, but does result in duplicate events.

READY events emitted by the provider before initialization has finished could simply be ignored by the SDK. I think this would be pretty solid.

I fear that if we have those options for opting out of automatic events, there will be many errors because it is just so easy to oversee.

kinyoklion commented 4 days ago

I am having a hard time with this. JS is still concurrent and depending on what happens in the background, events might be emitted out of order at any time.

In a concurrent language you can check a state and register a handler and there is no race condition as long as you do not await as a micro-task cannot be executed between those. In a multi-threaded system they can, and the state that needs locked is internal to the underlying SDK. The event dispatch itself may be many threads.

Which is why you have to introduce additional synchronization.

In regards to de-duplicating the event may be in a thread that has not gotten scheduled. So it actually emits after init. (Assuming the provider didn't use the event to complete the init function.)

So it may be a more general de-duplicate ready when already ready.

kinyoklion commented 4 days ago

The providers you showed are doing basically what I have seen in many applications using the SDKs directly.

The correct way to use the LD Java SDK directly is this:

 final LDClient client = new LDClient(SDK_KEY, config);
boolean flagValue = client.boolVariation(FEATURE_FLAG_KEY, context, false);

It has a blocking constructor that waits up to 10 seconds by default. You can make it longer if you want it to be longer. You can also independently track the current status, in case you need to know when it encounters problems. There is no need to use the status to wait for initialization. If you want non-blocking initialization you can set the time to wait to 0, and then handle things on flag changes or status changes.

I think I am going to change the LaunchDarkly providers back to just using the blocking constructor. I don't think it truly matches the spirit of the setProvider/setProviderAndAwait methods, but I know it is safe and predictable and keeps provider simplicity low.

lukas-reining commented 3 days ago

In a concurrent language you can check a state and register a handler and there is no race condition as long as you do not await as a micro-task cannot be executed between those.

But the point is that it is most likely that there will be an await happening so race conditions also have to be expected here.

In a multi-threaded system they can, and the state that needs locked is internal to the underlying SDK. The event dispatch itself may be many threads. Which is why you have to introduce additional synchronization.

Yes, but to me it would be fine if the SDK implementation had a critical path between receiving the first READY event and setting the provider state in the client, which can be synced with a mutex or so. I think shifting the responsibility to the SDK mitigates the problem in this case.

In regards to de-duplicating the event may be in a thread that has not gotten scheduled. So it actually emits after init. (Assuming the provider didn't use the event to complete the init function.)

This is a really good point. Maybe this can not really be solved elegantly. So we really have a problem if an SDK has a blocking/awaited async init method that is simply called in the provider init, but also provides events.

One potentially could in these cases always filter out READY events that happened before the init finished AND if none happened until the provider init finished, the first ready event of the provider regardless of the provider status. But maybe there is also an edge case I am overseeing. What do you think @kinyoklion?

If we can not solve it in i good way, I guess having the metadata options seems okay. I just don't like having flags changing provider and SDK providers that will definitely be confused or set wrong and also cause bugs. Maybe in this case we could even consider not having automatic events, but this brings whole new challenges.

I think I am going to change the LaunchDarkly providers back to just using the blocking constructor. I don't think it truly matches the spirit of the setProvider/setProviderAndAwait methods, but I know it is safe and predictable and keeps provider simplicity low.

Because listening to events is not reliable or what is the reasoning?

kinyoklion commented 1 day ago

Because listening to events is not reliable or what is the reasoning?

Regardless what the correct solution is here I don't think that the current solution has the correct level of complexity and safety. So I am considering using what I know is the safest path for someone consuming the LD SDK, which is also the simplest way of using it. (Safety here mainly being, initialization occurs within some known bounded time. Secondarily has logical events.)

Though I don't know that I can actually make this change upon review. I recently released 1.0 versions of our providers after adding init and shutdown. I would consider changing this a breaking behavior (In most cases the SDK is initialized by the time it hits the initialize method, but sometimes it wouldn't be, and in that case the behavior would be different).

To re-iterate, I think that having an option is one solution to this. But the problem I think is multiple owners of the same data and responsibilities. To which there could be other solutions.

The option allows it to be either fully controlled by the provider or fully controlled by the OF SDK (for simple providers). If it was always fully controlled by the OF SDK always, even for runtime changes, that would be fine as well. In that case the provider would be informing the SDK of changes in state and the SDK would decide which ones to emit based on the current state. Which, as it is also tracking the state, it has something to lock and compare. It also requires an ordering guarantee of the underlying SDK events, which most SDKs should have or it would make them hard to use.