Closed toddbaert closed 4 months ago
I think, reading this and the spec as it currently is, that stale may have morphed as single-context was added.
https://openfeature.dev/specification/types#provider-status
The provider's cached state is no longer valid and may not be up-to-date with the source of truth.
Which reads to me as different than a context transition for a single-context paradigm.
In most LaunchDarkly SDKs we have a data source status, which is largely similar to the provider state currently. Independent of that we have ready/initialized
.
So, in the case where we were not getting updates, the stream was interrupted, I was considering that stale. Example: https://github.com/launchdarkly/openfeature-java-server/blob/86eb1e2d1b47f80ac1016f06c6edd868d2fff49b/src/main/java/com/launchdarkly/openfeature/serverprovider/Provider.java#L154
My general feelings are this:
Some situations: "I want to know that this provider is ready to use (has initialized)?" - Seems good, and this simplifies the implementation. "I want to know if the flag values are up to date?" - Potential loss of granularity here. "I want to know if an unrecoverable error has been encountered?" - Not sure.
Some other examples of stale versus fresh.
I am for the provider not exposing a state in the same way that it was.
I have concerns that we are not exposing the correct actionable information currently or with this proposal.
I see your 2nd point, and I was also a bit concerned about the unclear semantics of STALE.
In the case of the react SDK (and likely any "front-end" SDKs) it's important to have some event that essentially means "context has changed, and my flags are being re-evaluated". During this time, it's desirable to have components "suspend", so that if application authors desire it, loading indicators can be shown while the new flag values are fetched (this is described in 5.3.4.x). Perhaps we're overloading STALE for this purpose. Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED
? Something that indicates a context change is pending?
It would simplify the status management if we could manage that in the SDK rather than in the provider. It was a source for me to be lost when trying to write it, especially without waiting for the initialize response.
I think this would simplify when writing the providers.
I like the proposal. I totally agree with your summary and remember some confusion around this when implementing it in the SDK.
Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?
This sounds like a good thing to have (even though this time might most times be pretty short).
I think this would simplify when writing the providers.
Totally think so too. This will be better than having complicated state handling in providers that is always "boilerplate".
None of the points you mentioned is a blocker for me here @toddbaert and I see the value so I would go with this change.
Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?
Something like
flowchart
NOT_READY -->|initialize| READY
READY -->|shutdown| NOT_READY
READY -->|flags out of date| STALE
STALE -->|flags updated| READY
READY -->|context changed| CONTEXT_CHANGE
CONTEXT_CHANGE -->|flags updated| READY
READY -->|some error condition| ERROR
ERROR -->|condition clears| READY
STALE -->|context changed| CONTEXT_CHANGE
CONTEXT_CHANGE -->|fail to update flags| ERROR
I like it.
I guess then, potentially out of scope of this specific conversation, remains the terminal error case.
Might it be better to have a client-specific event type for this, in addition to CONTEXT_CHANGED? Something that indicates a context change is pending?
Something like
flowchart NOT_READY -->|initialize| READY READY -->|shutdown| NOT_READY READY -->|flags out of date| STALE STALE -->|flags updated| READY READY -->|context changed| CONTEXT_CHANGE CONTEXT_CHANGE -->|flags updated| READY READY -->|some error condition| ERROR ERROR -->|condition clears| READY STALE -->|context changed| CONTEXT_CHANGE CONTEXT_CHANGE -->|fail to update flags| ERROR
I like it.
Ya, I think this is how things would work notionally, though I'd hope that provider-authors and especially application-authors would have to worry less about understanding these transitions if the SDK maintains the necessary state.
One more thing probably worth mentioning is that CONTEXT_CHANGE
is not a state, but an event. Below is (I believe) an equivalent state diagram illustrating the same flow, which denotes the events associated with state changes. Note it adds a new event as mentioned above, PROVIDER_CONTEXT_CHANGE_PENDING
(this would correspond to the event which activates the suspense in the React SDK, in the place of PROVIDER_STALE
).
---
title: Provider context reconciliation
---
stateDiagram-v2
direction LR
NOT_READY --> READY:emit(PROVIDER_READY)
READY --> NOT_READY
READY --> STALE:emit(PROVIDER_STALE)
STALE --> READY:emit(PROVIDER_CONFIGURATION_CHANGED)
READY --> CONTEXT_CHANGE_PENDING:emit(PROVIDER_CONTEXT_CHANGE_PENDING)
CONTEXT_CHANGE_PENDING --> READY:emit(PROVIDER_CONTEXT_CHANGED)
READY --> ERROR:emit(PROVIDER_ERROR)
ERROR --> READY:emit(PROVIDER_READY)
STALE --> READY:emit(PROVIDER_CONTEXT_CHANGED)
I guess then, potentially out of scope of this specific conversation, remains the terminal error case.
I wonder if NOT_READY
is sufficient? I'd hope we can avoid a terminal case if we fully init again.
I wonder if NOT_READY is sufficient? I'd hope we can avoid a terminal case if we fully init again.
Would it go ERROR
-> NOT_READY
?
Or would it just specifically go to NOT_READY
.
What does it emit to get there?
I wonder if NOT_READY is sufficient? I'd hope we can avoid a terminal case if we fully init again.
Would it go
ERROR
->NOT_READY
?Or would it just specifically go to
NOT_READY
. What does it emit to get there?
Thinking about it more, I think NOT_READY
isn't right. I think NOT_READY
makes sense for the state after a shutdown()
, but not for terminal errors. I think I'm struggling to identify an example of a terminal error that would be applicable to a large set of providers. A config supplied in the provider constructor (like a credential or a URL) might not work for some time, but may work later (maybe a host or a DNS entry needs to be provisioned). Is there a good reason to discriminate between transient and terminal ERRORS
here?
Is there a good reason to discriminate between transient and terminal
ERRORS
here?
The simplest, and likely widely applicable, is that you rotate the credential. When that happens there isn't a reason for an SDK to continue to attempt requests with a bad credential.
For LD we have a few more, but they may not be as applicable.
Is there a good reason to discriminate between transient and terminal
ERRORS
here?The simplest, and likely widely applicable, is that you rotate the credential. When that happens there isn't a reason for an SDK to continue to attempt requests with a bad credential.
For LD we have a few more, but they may not be as applicable.
I see. In that case, I'd expect an error is thrown during initialize()
. That would result in error handlers being run, and (if the asynchronous provider-mutator is used, an error being thrown there). If we make the provider state notional as this issue proposes, and keep it abstracted away from the application author/integrator, then I think the material difference if we add specific support for terminal errors would be that we could add a new event type (perhaps PROVIDER_FATAL
) which would signal a terminal error.
Outside of initialize this happens:
1.) Initialize your SDK and it works for some time. 2.) You rotate your credential. 3.) It disconnects and attempts to reconnect learning the credential is not valid.
Or polling would be affected the same. It is outside initialize time. During initialization specifically it makes sense to handle it a little differently I agree.
PROVIDER_FATAL
makes sense to me.
@kinyoklion That makes sense, thanks. OK I'll wait for some more feedback, but I think adding something equivalent to PROVIDER_FATAL
makes sense. I'll include that in my spec PR if there's a consensus to open one.
In general looks good to me, thanks for this. Will have to see all the corner cases on providers implementation.
if (ProviderState.NOT_READY.equals(state)) {
throw new ProviderNotReadyError(PROVIDER_NOT_YET_INITIALIZED);
}
So to clarify, with this proposal, this check can move from the provider to the sdk.
Wondering, is it possible that there are providers which do not support re-initialization after shutdown ? in this case on the flow of init -> shutdown -> init, the second init call is not idempotent, since afterwards the provider is not initialized and ready.
So it is the case for LaunchDarkly providers currently, our SDKs are init once, and when you shut one down it is done. What this is going to mean for us from a provider level though is that we have to re-structure such that the relationship between the vendor SDK and the provider isn't 1:1 as we initially developed. We will have to make new vendor SDK instances when we encounter that situation.
If your vendor SDK was a singleton, and it also had this single direction behavior, then it could be somewhat harder to overcome that requirement.
In general looks good to me, thanks for this. Will have to see all the corner cases on providers implementation.
- According to spec Requirement 2.4.5: "The provider SHOULD indicate an error if flag resolution is attempted before the provider is ready." currently, the state handling in providers that is always "boilerplate" is commonly at the provider. Example provider code:
if (ProviderState.NOT_READY.equals(state)) { throw new ProviderNotReadyError(PROVIDER_NOT_YET_INITIALIZED); }
So to clarify, with this proposal, this check can move from the provider to the sdk.
Yes. In general, this proposal makes things easier for providers. This is another good example.
- "Providers would only have to ensure initialize and shutdown are idempotent and re-entrant." Wondering, is it possible that there are providers which do not support re-initialization after shutdown ? in this case on the flow of init -> shutdown -> init, the second init call is not idempotent, since afterwards the provider is not initialized and ready.
I think we can encourage the idempotency in non-normative sections, and talk about that exception. I was mainly concerned about duplicated init calls. This is because since the state is no longer on the provider itself, it's possible the same provider instance might be set more than once, and therefore init could execute more than once. We could make sure the SDKs handle this situation though, by keeping track of which provider refs have already been initialized.
- I would maybe update the proposal/follow-up PR title to something like "move Provider Status field from provider to sdk" to reduce confusion, initially without reading the description, I though the proposal meant to remove the state entirely.
Done, good idea.
From a provider-author's perspective, providers would become stateless
As others had stated, this should benefit provider authors as this simplifies eventing implementation. From SDK view, we already maintain provider references, so storing the last known provider state should be straightforward. We may also keep 5.3.3 support.
re: init shutdown behavior & terminal errors
Even now SDK implementations avoid re-initialization based on provider state. So once state management get transferred to SDK, I think we could enforce idempotency on init easily through SDK.
However, I think we need to think more about shutdown behavior. Right now SDK shutdown propagates the shutdown to all registered providers. Along with that, I expect SDK internals to clear up making states and event handlers reset. So if a provider in shutdown state get registered, SDK won't be able to know its state. And I expect the provider to error when initialization is attempted 🤔
If we cannot differentiate terminal errors and errors (provider_error event), I don't think we need to introduce a new event.
As others had stated, this should benefit provider authors as this simplifies eventing implementation. From SDK view, we already maintain provider references, so storing the last known provider state should be straightforward. We may also keep 5.3.3 support.
Ya, I think agree. I think we can keep the 5.3.3 behavior, and this will make the change smaller and less disruptive. We'll just have to make the idea of provider state
clear to application author (but this is already the case with 5.3.3).
If we cannot differentiate terminal errors and errors (provider_error event), I don't think we need to introduce a new event.
I'm starting to think that it might be best to add some kind of additional field to provider errors that indicates if they are fatal or not. This has the benefit of compatibility with what @Kavindu-Dodan described above, and it also resolves a concern I have where application authors might attach error
handlers, thinking that will cover all error cases, and fail to attach fatal
handlers.
We'll just have to make the idea of provider state clear to application author
Yes, I think OF API should expose provider state. OF.getStatus()
for default provider and then OF.getStatus(<NAME>)
for scoped/named providers
application authors might attach error handlers, thinking that will cover all error cases, and fail to attach fatal handlers.
Agree. Unrecoverable errors could be built into the current error definition. We just need to figure out how
We'll just have to make the idea of provider state clear to application author
Yes, I think OF API should expose provider state.
OF.getStatus()
for default provider and thenOF.getStatus(<NAME>)
for scoped/named providers
We could do this, but honestly, I think it's more important to expose client.getProviderStatus()
which is already associated with a particular provider.
Agree. Unrecoverable errors could be built into the current error definition. We just need to figure out how
:+1: I have proposed a new error code for this.
I've opened https://github.com/open-feature/spec/pull/241
I'm in support of these changes as well.
One question I had thinking through this in the context of our web provider is recently, while implementing support for the new React SDK, we had to fix our implementation of PROVIDER_CONFIGURATION_CHANGED
events in our provider to inform the React SDK when changes are received from our SSE real-time updates system.
However, our real-time update system is a two-step process, where the SSE event is received by our SDK, which then triggers an API call to fetch updated flag values. In this new status model, my understanding is that our SDK should emit a PROVIDER_STALE
event once we receive a SSE event that our configuration is stale and a PROVIDER_CONFIGURATION_CHANGED
event once flag values have been updated.
I'm in support of these changes as well.
One question I had thinking through this in the context of our web provider is recently, while implementing support for the new React SDK, we had to fix our implementation of
PROVIDER_CONFIGURATION_CHANGED
events in our provider to inform the React SDK when changes are received from our SSE real-time updates system.However, our real-time update system is a two-step process, where the SSE event is received by our SDK, which then triggers an API call to fetch updated flag values. In this new status model, my understanding is that our SDK should emit a
PROVIDER_STALE
event once we receive a SSE event that our configuration is stale and aPROVIDER_CONFIGURATION_CHANGED
event once flag values have been updated.
Yes I think that's in line with that's proposed.
Background
Currently, provider authors are required to maintain a field indicating their state: (
NOT_READY
/READY
/STALE
/ERROR
). This burdens provider authors in a few ways that are not particularly ergonomic, and creates potential foot-guns. This is particularly concerning for SDKs which rely heavily on events (React, for example); it's quite possible for the "suspense" functionality to break and suspend indefinitely if a provider fails to correctly set their READY state after they are initialized. Additionally, we may need additional states/events to support emerging client use-cases.Problem 1: State-management at Initialization
The SDK itself calls
initialize()
on any provider in stateNOT_READY
. The SDK then emits anREADY
orERROR
event based on the successful/unsuccessful termination ofinitialize()
. However, it's the provider's responsibility to then set thestate
correctly. This results in confusion, and frequently provider authors neglect to properly set the providers status (this is often caught in PRs).Problem 2: State-management at context change (client paradigm only)
The static context paradigm (client-side paradigm) defines a
context change
method that can be implemented by the provider, which performs whatever operations are needed to reconcile the evaluated flags with the new context. Similar to problem above, the burden is put on the provider author to correctly update state (as well as emit the STALE event) when this function is called: https://openfeature.dev/specification/sections/events/#condition-534. Additionally, the semantics of PROVIDER_STALE event may not be ideal or sufficient for this use case; STALE was not originally intended to indicate the providers flags needed to be resolved due to a context change.Solution:
To resolve the both the above, the provider state should be maintained by the SDK, and become only notional to the application author, and the SDK should emit all events that are part of the provider lifecycle. This will make the provider lifecycle entirely the domain of the SDK; the provider would need only to implement the optional
initialize
andshutdown
, andcontext change
(client-only) methods, and the SDK should handle the rest. From a provider-author's perspective, providers would become stateless; the authors job is simply to implement the desired interfaces, while the SDK would manage the entire lifecycle as appropriate, based on the implemented methods. SDKs would run initialize/onContextChange/shutdown if they are defined, and maintain any state if necessary. Providers would only have to ensure initialize and shutdown are idempotent and re-entrant. For theonContextChange
, the SDK would also emit a new event (PROVIDER_CONTEXT_CHANGE_PENDING
?) when the context is updated as well. See the example below:Current example of a compliant onContextChange:
Proposed simplification:
Considerations
ERROR
/READY
events when they disconnect/recover; these may still be "manually" emitted from providers since they originate with state changes specific to the provider implementation, and not lifecycle changes originating from the SDK.PROVIDER_STALE
, to indicate the specific circumstance that a provider is pending an update after a context change.Next steps
I've done a PoC in JS with this, which you can see here.
If we agree on this path forward, we can remove this requirement from the spec, which would entail the update and removal of a few points. Follow-up implementation in SDKs should be relatively simple and non-breaking.
@lukas-reining @kinyoklion @thomaspoignant @beeme1mr @liran2000 @luizgribeiro @jonathannorris @fabriziodemaria @vahidlazio if this sounds compelling I will open a spec PR.