Closed sdalezman closed 4 years ago
There certainly is logic in the SDK to retry after stream connection failures, and it is not supposed to ever give up retrying (unless it gets a server response that specifically says the SDK key is invalid, which wouldn't have been the case here). There is backoff logic so the retries will be delayed by progressively longer intervals, but the retry interval should never be longer than 30 seconds. None of this logic was intentionally changed in Go SDK 5.0, and there is test coverage of retry scenarios but it's possible that the coverage has missed some edge case; we'll investigate. We do maintain long-running test app instances to catch exactly this sort of instability, and we have not seen the test apps fail to reconnect after a known service outage.
To your other point, about existing flag data being "wiped out of the SDK" (I assume that when you wrote "Expected behavior: existing flags are now wiped out of the SDK", that was a typo and you meant the expected behavior is that they're not wiped out)— the failure of a stream connection is not supposed to ever cause any data in the SDK to be purged, and I can't see a mechanism for it to do so. The flag data store is only modified when the SDK receives an update from the LaunchDarkly stream; if the stream isn't working, it receives no updates and the data store is unchanged.
If you're using a database such as Redis for the data store, and you've configured it with a finite cache TTL, then once the TTL expires, the flag data in the in-memory cache does get discarded— but that just means it will reread the last known data from Redis. In that scenario, if Redis were not working at that point, then it would indeed behave as if the flag didn't exist and it would return default values for evaluations. But that has nothing to do with whether the stream is active or not. So, regardless of whether you're using a database or not, the behavior you're describing is mysterious.
Can I assume that there was no other ERROR- or WARN-level log output from the SDK besides the message you quoted?
Hi @eli-darkly - we're not currently using a data store backed engine, however, that has never caused an issue for us before. Do you recommend that we set up the redis check instead of relying on in memory checks?
Yes it was a type and I meant to say they should not be wiped out.
I'm still fairly convinced this is as the sdk level rather than the LD level because we have other versions of the sdk running that never had these issues, but have been seeing it with v5.
There were a bunch more warn + errors but that seemed the most pertinent from the issue that we saw. Happy to share it directly if you want to shoot me an email - shlomo at intello.io
we're not currently using a data store backed engine, however, that has never caused an issue for us before. Do you recommend that we set up the redis check instead of relying on in memory checks?
No, that's not what I'm saying.
I only meant that if you were using Redis, there is a potential situation where the SDK could forget about flag data that it had previously received— but that would only happen if you were using a finite cache TTL and it expired while the Redis server was not working. And I mentioned that only because, as far as I'm aware, that is the only situation where the SDK could ever forget any flag data.
It's certainly possible that this is an SDK bug, and as I said, we will investigate it. I'm just trying to clarify that this would indeed be a bug, not how it is designed to work. And I am still having difficulty trying to imagine any way that the loss of a stream connection could cause flag data to be lost. Again, when a stream is active, any updates that it receives will be used to update the last known set of flag data— and whenever the stream is not active, it simply isn't receiving updates, so the last known flag data should remain unchanged.
There were a bunch more warn + errors but that seemed the most pertinent from the issue that we saw
It's best to assume that all warning/error-level logging is pertinent to a bug report. You can either email me at eli@launchdarkly.com, or pursue this through our usual support process instead of on GitHub.
Generally, if diagnosing a problem is likely to require access to any potentially sensitive information about your operating environment, such as your logs and your SDK configuration— or knowing what LaunchDarkly account and environment you're using, to help us try to reproduce the issue— then that is more of a support issue. GitHub bug reports are best for things that are unambiguously an SDK code issue and that are straightforward to reproduce in any environment. It's still unclear to me whether this is, as far as you know, a reproducible issue, or if you're only assuming that it is reproducible for any stream interruption because it happened once during a stream interruption.
Is this a support request? No, this is a bug in the SDK
Describe the bug On October 29th - we saw this error from launchdarkly:
[LaunchDarkly] 2020/10/29 19:04:51 ERROR: LaunchDarkly data source outage - updates have been unavailable for at least 1m0s with the following errors: ERROR_RESPONSE(503) (2 times), ERROR_RESPONSE(504) (2 times), NETWORK_ERROR (1 time)
(time is est)At the same time we know there was a streaming issue with LD on that day. On that day we stopped seeing LD flag evaluations till we restarted the service. You also updated your certificates, not 100% sure which might have impacted the service error.
The error and not being able to connect makes sense to me to happen from time to time, that's not my issue. However, not having resiliency in the library to recover from this is pretty bad given the core reliance we have in LD for service our product.
In addition to not reporting data to LD it almost seems like the library wiped out the ability to do realtime flag evaluation. As any flag evaluation did not return true. To my understanding the LD sdk is responsible for all evaluations on the sdk level rather than through the LD api - this gives LD the performance to not be a performance bottleneck.
Might expectation is the following:
When the library encounters a streaming error:
Maybe I'm mistaken that this can be met, but I know we have other services that have older versions of the SDK where we have seen the same errors and they have successfully recovered. So I believe this is a bug in the current version of the SDK.
To reproduce Create a streaming issue, flag evaluation will not continue to work
Expected behavior When the library encounters a streaming error:
Logs See above
SDK version v5.0.0
Language version, developer tools 1.13.3
OS/platform go binary from a scratch docker file
Additional context n/a