open-feature / spec

OpenFeature specification
https://openfeature.dev
Apache License 2.0
588 stars 35 forks source link

Proposal to move `Provider Status` field from provider to SDK, refine semantics around ContextChange events #238

Closed toddbaert closed 4 months ago

toddbaert commented 5 months ago

Background

Currently, provider authors are required to maintain a field indicating their state: (NOT_READY/READY/STALE/ERROR). This burdens provider authors in a few ways that are not particularly ergonomic, and creates potential foot-guns. This is particularly concerning for SDKs which rely heavily on events (React, for example); it's quite possible for the "suspense" functionality to break and suspend indefinitely if a provider fails to correctly set their READY state after they are initialized. Additionally, we may need additional states/events to support emerging client use-cases.

Problem 1: State-management at Initialization

The SDK itself calls initialize() on any provider in state NOT_READY. The SDK then emits an READY or ERROR event based on the successful/unsuccessful termination of initialize(). However, it's the provider's responsibility to then set the state correctly. This results in confusion, and frequently provider authors neglect to properly set the providers status (this is often caught in PRs).

Problem 2: State-management at context change (client paradigm only)

The static context paradigm (client-side paradigm) defines a context change method that can be implemented by the provider, which performs whatever operations are needed to reconcile the evaluated flags with the new context. Similar to problem above, the burden is put on the provider author to correctly update state (as well as emit the STALE event) when this function is called: https://openfeature.dev/specification/sections/events/#condition-534. Additionally, the semantics of PROVIDER_STALE event may not be ideal or sufficient for this use case; STALE was not originally intended to indicate the providers flags needed to be resolved due to a context change.

Solution:

To resolve the both the above, the provider state should be maintained by the SDK, and become only notional to the application author, and the SDK should emit all events that are part of the provider lifecycle. This will make the provider lifecycle entirely the domain of the SDK; the provider would need only to implement the optional initialize and shutdown, and context change (client-only) methods, and the SDK should handle the rest. From a provider-author's perspective, providers would become stateless; the authors job is simply to implement the desired interfaces, while the SDK would manage the entire lifecycle as appropriate, based on the implemented methods. SDKs would run initialize/onContextChange/shutdown if they are defined, and maintain any state if necessary. Providers would only have to ensure initialize and shutdown are idempotent and re-entrant. For the onContextChange, the SDK would also emit a new event (PROVIDER_CONTEXT_CHANGE_PENDING?) when the context is updated as well. See the example below:

Current example of a compliant onContextChange:

  async onContextChange(oldContext: EvaluationContext, newContext: EvaluationContext): Promise<void> {
    this.status = ProviderStatus.STALE; // author must manually emit STALE
    this.events.emit(ProviderEvents.Stale); // author must manually set state
    await someAsyncWorkToUpdateFlags();
    this.status = ProviderStatus.READY; // semantics here are not well-defined
  }

Proposed simplification:

  // SDK will have already emit a PROVIDER_CONTEXT_CHANGE_PENDING event
  async onContextChange(oldContext: EvaluationContext, newContext: EvaluationContext): Promise<void> {
    await someAsyncWorkToUpdateFlags();
    // when this method returns/resolves, SDK emits PROVIDER_CONTEXT_CHANGED
  }

Considerations

Next steps

I've done a PoC in JS with this, which you can see here.

If we agree on this path forward, we can remove this requirement from the spec, which would entail the update and removal of a few points. Follow-up implementation in SDKs should be relatively simple and non-breaking.

@lukas-reining @kinyoklion @thomaspoignant @beeme1mr @liran2000 @luizgribeiro @jonathannorris @fabriziodemaria @vahidlazio if this sounds compelling I will open a spec PR.

kinyoklion commented 5 months ago

I think, reading this and the spec as it currently is, that stale may have morphed as single-context was added.

https://openfeature.dev/specification/types#provider-status

The provider's cached state is no longer valid and may not be up-to-date with the source of truth.

Which reads to me as different than a context transition for a single-context paradigm.

In most LaunchDarkly SDKs we have a data source status, which is largely similar to the provider state currently. Independent of that we have ready/initialized.

So, in the case where we were not getting updates, the stream was interrupted, I was considering that stale. Example: https://github.com/launchdarkly/openfeature-java-server/blob/86eb1e2d1b47f80ac1016f06c6edd868d2fff49b/src/main/java/com/launchdarkly/openfeature/serverprovider/Provider.java#L154

My general feelings are this:

  1. I am for the provider not exposing a state in the same way that it was.
  2. I have concerns that we are not exposing the correct actionable information currently or with this proposal.

Some situations: "I want to know that this provider is ready to use (has initialized)?" - Seems good, and this simplifies the implementation. "I want to know if the flag values are up to date?" - Potential loss of granularity here. "I want to know if an unrecoverable error has been encountered?" - Not sure.

Some other examples of stale versus fresh.

  1. Vendor SDK loaded cached values from device storage, but has not yet loaded fresh values. This can happen on an SDK startup, but maybe also changing contexts in single-context, because maybe the SDK keeps some number of cached contexts.
  2. Vendor SDK disconnects entering background in a mobile app and then reconnects when the app enters the foreground. This results in things temporarily being stale, but not an error.
  3. Vendor SDK has detected the device is offline and disconnects until it detects it is offline. Flags are again temporarily stale, but not because of an error.
toddbaert commented 5 months ago
  1. I am for the provider not exposing a state in the same way that it was.

  2. I have concerns that we are not exposing the correct actionable information currently or with this proposal.

I see your 2nd point, and I was also a bit concerned about the unclear semantics of STALE.

In the case of the react SDK (and likely any "front-end" SDKs) it's important to have some event that essentially means "context has changed, and my flags are being re-evaluated". During this time, it's desirable to have components "suspend", so that if application authors desire it, loading indicators can be shown while the new flag values are fetched (this is described in 5.3.4.x). Perhaps we're overloading STALE for this purpose. Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?

thomaspoignant commented 5 months ago

It would simplify the status management if we could manage that in the SDK rather than in the provider. It was a source for me to be lost when trying to write it, especially without waiting for the initialize response.

I think this would simplify when writing the providers.

lukas-reining commented 5 months ago

I like the proposal. I totally agree with your summary and remember some confusion around this when implementing it in the SDK.

Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?

This sounds like a good thing to have (even though this time might most times be pretty short).

I think this would simplify when writing the providers.

Totally think so too. This will be better than having complicated state handling in providers that is always "boilerplate".

None of the points you mentioned is a blocker for me here @toddbaert and I see the value so I would go with this change.

kinyoklion commented 5 months ago

Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?

Something like

flowchart
    NOT_READY -->|initialize| READY
    READY -->|shutdown| NOT_READY
    READY -->|flags out of date| STALE
    STALE -->|flags updated| READY
    READY -->|context changed| CONTEXT_CHANGE
    CONTEXT_CHANGE -->|flags updated| READY
    READY -->|some error condition| ERROR
    ERROR -->|condition clears| READY
    STALE -->|context changed| CONTEXT_CHANGE
    CONTEXT_CHANGE -->|fail to update flags| ERROR

I like it.

I guess then, potentially out of scope of this specific conversation, remains the terminal error case.

toddbaert commented 5 months ago

Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?

Something like

flowchart
    NOT_READY -->|initialize| READY
    READY -->|shutdown| NOT_READY
    READY -->|flags out of date| STALE
    STALE -->|flags updated| READY
    READY -->|context changed| CONTEXT_CHANGE
    CONTEXT_CHANGE -->|flags updated| READY
    READY -->|some error condition| ERROR
    ERROR -->|condition clears| READY
    STALE -->|context changed| CONTEXT_CHANGE
    CONTEXT_CHANGE -->|fail to update flags| ERROR

I like it.

Ya, I think this is how things would work notionally, though I'd hope that provider-authors and especially application-authors would have to worry less about understanding these transitions if the SDK maintains the necessary state.

One more thing probably worth mentioning is that CONTEXT_CHANGE is not a state, but an event. Below is (I believe) an equivalent state diagram illustrating the same flow, which denotes the events associated with state changes. Note it adds a new event as mentioned above, PROVIDER_CONTEXT_CHANGE_PENDING (this would correspond to the event which activates the suspense in the React SDK, in the place of PROVIDER_STALE).

---
title: Provider context reconciliation 
---
stateDiagram-v2
    direction LR
    NOT_READY --> READY:emit(PROVIDER_READY)
    READY --> NOT_READY
    READY --> STALE:emit(PROVIDER_STALE)
    STALE --> READY:emit(PROVIDER_CONFIGURATION_CHANGED)
    READY --> CONTEXT_CHANGE_PENDING:emit(PROVIDER_CONTEXT_CHANGE_PENDING)
    CONTEXT_CHANGE_PENDING --> READY:emit(PROVIDER_CONTEXT_CHANGED)
    READY --> ERROR:emit(PROVIDER_ERROR)
    ERROR --> READY:emit(PROVIDER_READY)
    STALE --> READY:emit(PROVIDER_CONTEXT_CHANGED)

I guess then, potentially out of scope of this specific conversation, remains the terminal error case.

I wonder if NOT_READY is sufficient? I'd hope we can avoid a terminal case if we fully init again.

kinyoklion commented 5 months ago

I wonder if NOT_READY is sufficient? I'd hope we can avoid a terminal case if we fully init again.

Would it go ERROR -> NOT_READY?

Or would it just specifically go to NOT_READY. What does it emit to get there?

toddbaert commented 5 months ago

I wonder if NOT_READY is sufficient? I'd hope we can avoid a terminal case if we fully init again.

Would it go ERROR -> NOT_READY?

Or would it just specifically go to NOT_READY. What does it emit to get there?

Thinking about it more, I think NOT_READY isn't right. I think NOT_READY makes sense for the state after a shutdown(), but not for terminal errors. I think I'm struggling to identify an example of a terminal error that would be applicable to a large set of providers. A config supplied in the provider constructor (like a credential or a URL) might not work for some time, but may work later (maybe a host or a DNS entry needs to be provisioned). Is there a good reason to discriminate between transient and terminal ERRORS here?

kinyoklion commented 5 months ago

Is there a good reason to discriminate between transient and terminal ERRORS here?

The simplest, and likely widely applicable, is that you rotate the credential. When that happens there isn't a reason for an SDK to continue to attempt requests with a bad credential.

For LD we have a few more, but they may not be as applicable.

toddbaert commented 5 months ago

Is there a good reason to discriminate between transient and terminal ERRORS here?

The simplest, and likely widely applicable, is that you rotate the credential. When that happens there isn't a reason for an SDK to continue to attempt requests with a bad credential.

For LD we have a few more, but they may not be as applicable.

I see. In that case, I'd expect an error is thrown during initialize(). That would result in error handlers being run, and (if the asynchronous provider-mutator is used, an error being thrown there). If we make the provider state notional as this issue proposes, and keep it abstracted away from the application author/integrator, then I think the material difference if we add specific support for terminal errors would be that we could add a new event type (perhaps PROVIDER_FATAL) which would signal a terminal error.

kinyoklion commented 5 months ago

Outside of initialize this happens:

1.) Initialize your SDK and it works for some time. 2.) You rotate your credential. 3.) It disconnects and attempts to reconnect learning the credential is not valid.

Or polling would be affected the same. It is outside initialize time. During initialization specifically it makes sense to handle it a little differently I agree.

PROVIDER_FATAL makes sense to me.

toddbaert commented 5 months ago

@kinyoklion That makes sense, thanks. OK I'll wait for some more feedback, but I think adding something equivalent to PROVIDER_FATAL makes sense. I'll include that in my spec PR if there's a consensus to open one.

liran2000 commented 5 months ago

In general looks good to me, thanks for this. Will have to see all the corner cases on providers implementation.

kinyoklion commented 5 months ago

Wondering, is it possible that there are providers which do not support re-initialization after shutdown ? in this case on the flow of init -> shutdown -> init, the second init call is not idempotent, since afterwards the provider is not initialized and ready.

So it is the case for LaunchDarkly providers currently, our SDKs are init once, and when you shut one down it is done. What this is going to mean for us from a provider level though is that we have to re-structure such that the relationship between the vendor SDK and the provider isn't 1:1 as we initially developed. We will have to make new vendor SDK instances when we encounter that situation.

If your vendor SDK was a singleton, and it also had this single direction behavior, then it could be somewhat harder to overcome that requirement.

toddbaert commented 5 months ago

In general looks good to me, thanks for this. Will have to see all the corner cases on providers implementation.

  • According to spec Requirement 2.4.5: "The provider SHOULD indicate an error if flag resolution is attempted before the provider is ready." currently, the state handling in providers that is always "boilerplate" is commonly at the provider. Example provider code:
if (ProviderState.NOT_READY.equals(state)) {
    throw new ProviderNotReadyError(PROVIDER_NOT_YET_INITIALIZED);
}

So to clarify, with this proposal, this check can move from the provider to the sdk.

Yes. In general, this proposal makes things easier for providers. This is another good example.

  • "Providers would only have to ensure initialize and shutdown are idempotent and re-entrant." Wondering, is it possible that there are providers which do not support re-initialization after shutdown ? in this case on the flow of init -> shutdown -> init, the second init call is not idempotent, since afterwards the provider is not initialized and ready.

I think we can encourage the idempotency in non-normative sections, and talk about that exception. I was mainly concerned about duplicated init calls. This is because since the state is no longer on the provider itself, it's possible the same provider instance might be set more than once, and therefore init could execute more than once. We could make sure the SDKs handle this situation though, by keeping track of which provider refs have already been initialized.

  • I would maybe update the proposal/follow-up PR title to something like "move Provider Status field from provider to sdk" to reduce confusion, initially without reading the description, I though the proposal meant to remove the state entirely.

Done, good idea.

Kavindu-Dodan commented 5 months ago

From a provider-author's perspective, providers would become stateless

As others had stated, this should benefit provider authors as this simplifies eventing implementation. From SDK view, we already maintain provider references, so storing the last known provider state should be straightforward. We may also keep 5.3.3 support.

re: init shutdown behavior & terminal errors

Even now SDK implementations avoid re-initialization based on provider state. So once state management get transferred to SDK, I think we could enforce idempotency on init easily through SDK.

However, I think we need to think more about shutdown behavior. Right now SDK shutdown propagates the shutdown to all registered providers. Along with that, I expect SDK internals to clear up making states and event handlers reset. So if a provider in shutdown state get registered, SDK won't be able to know its state. And I expect the provider to error when initialization is attempted 🤔

If we cannot differentiate terminal errors and errors (provider_error event), I don't think we need to introduce a new event.

toddbaert commented 4 months ago

As others had stated, this should benefit provider authors as this simplifies eventing implementation. From SDK view, we already maintain provider references, so storing the last known provider state should be straightforward. We may also keep 5.3.3 support.

Ya, I think agree. I think we can keep the 5.3.3 behavior, and this will make the change smaller and less disruptive. We'll just have to make the idea of provider state clear to application author (but this is already the case with 5.3.3).

If we cannot differentiate terminal errors and errors (provider_error event), I don't think we need to introduce a new event.

I'm starting to think that it might be best to add some kind of additional field to provider errors that indicates if they are fatal or not. This has the benefit of compatibility with what @Kavindu-Dodan described above, and it also resolves a concern I have where application authors might attach error handlers, thinking that will cover all error cases, and fail to attach fatal handlers.

Kavindu-Dodan commented 4 months ago

We'll just have to make the idea of provider state clear to application author

Yes, I think OF API should expose provider state. OF.getStatus() for default provider and then OF.getStatus(<NAME>) for scoped/named providers

application authors might attach error handlers, thinking that will cover all error cases, and fail to attach fatal handlers.

Agree. Unrecoverable errors could be built into the current error definition. We just need to figure out how

toddbaert commented 4 months ago

We'll just have to make the idea of provider state clear to application author

Yes, I think OF API should expose provider state. OF.getStatus() for default provider and then OF.getStatus(<NAME>) for scoped/named providers

We could do this, but honestly, I think it's more important to expose client.getProviderStatus() which is already associated with a particular provider.

Agree. Unrecoverable errors could be built into the current error definition. We just need to figure out how

:+1: I have proposed a new error code for this.

toddbaert commented 4 months ago

I've opened https://github.com/open-feature/spec/pull/241

jonathannorris commented 4 months ago

I'm in support of these changes as well.

One question I had thinking through this in the context of our web provider is recently, while implementing support for the new React SDK, we had to fix our implementation of PROVIDER_CONFIGURATION_CHANGED events in our provider to inform the React SDK when changes are received from our SSE real-time updates system.

However, our real-time update system is a two-step process, where the SSE event is received by our SDK, which then triggers an API call to fetch updated flag values. In this new status model, my understanding is that our SDK should emit a PROVIDER_STALE event once we receive a SSE event that our configuration is stale and a PROVIDER_CONFIGURATION_CHANGED event once flag values have been updated.

toddbaert commented 4 months ago

I'm in support of these changes as well.

One question I had thinking through this in the context of our web provider is recently, while implementing support for the new React SDK, we had to fix our implementation of PROVIDER_CONFIGURATION_CHANGED events in our provider to inform the React SDK when changes are received from our SSE real-time updates system.

However, our real-time update system is a two-step process, where the SSE event is received by our SDK, which then triggers an API call to fetch updated flag values. In this new status model, my understanding is that our SDK should emit a PROVIDER_STALE event once we receive a SSE event that our configuration is stale and a PROVIDER_CONFIGURATION_CHANGED event once flag values have been updated.

Yes I think that's in line with that's proposed.