Open khepin opened 3 years ago
In the current implementation, there is no way for the Relay Proxy to detect reliably that Redis has restarted.
It can detect if Redis is unavailable at a time when it's actually trying to do something with Redis— that is, if it receives a flag update from LaunchDarkly and tries to save it to Redis, or if it gets a request for flags from an SDK at a time when the cache has expired so it has to read Redis, and Redis is unavailable at that moment, it will detect that as a database error. In that case, it enters a "database is unavailable" state where it keeps checking until the database is available again, and when it is, then it will re-save all the flags.
But if the Redis connection goes away briefly during a time when it wasn't active anyway, the Relay Proxy can't detect that— the Redis client doesn't (as far as I know) make such interruptions visible to the application, it handles them transparently. There is no signal to tell it "the database was available last time you checked, and it is also available now, but all the data has been lost in between then and now."
In any case, the current situation is that if you know you are going to restart Redis and lose your data, you need to also restart the Relay Proxy.
Are there any plans (or avenues to be explored) to move away from the current situation towards something more reliable?
This is as far as I know the first time the issue has been raised, so there haven't been any plans up till now and it's too soon for me to comment on how we might handle this. Again, at present it's not clear to what extent the Redis client library allows us to even detect this condition, so I'm not sure if this is an issue that's specific to the Relay Proxy or a more general issue inherent in using Redis without a persistent backing store.
From an initial look at the code I believe you're using the connection pool for redis which indeed would hide connection failures. That package also offers the ability to make a direct connection where a periodic healthcheck could be sent to check if the connection is still alive.
The following code for example detects connection errors by sending a periodic PING
. If I then restart redis it'll detect EOF
on the connection on the next attempt.
package main
import (
"fmt"
"time"
"github.com/gomodule/redigo/redis"
)
func main() {
err := <-signalOnConnectionFailure()
fmt.Println(err)
}
func signalOnConnectionFailure() <-chan (error) {
healthchecks := make(chan (error))
go func() {
c, err := redis.Dial("tcp", "redis:6379")
if err != nil {
healthchecks <- err
close(healthchecks)
return
}
for true {
<-time.After(30 * time.Second)
_, err := c.Do("PING")
if err != nil {
healthchecks <- err
close(healthchecks)
return
}
// connection healthy
}
}()
return healthchecks
}
ld-relay could just maintain/verify a marker entry in Redis. If the entry is gone - that means Redis needs to be repopulated.
Yeah that seems like a very valid approach as well.
Well, there already is a marker entry, it's just that up till now the data store implementation did not do any kind of constant polling of the database (except in the case I mentioned after it's already run into a server error) but only accessed it when there was an immediate need to read or write data. We were trying to avoid creating more server traffic than was strictly necessary for applications to use LD.
We may need to re-evaluate that but we'll need to do it carefully - there are some design decisions to be made. This whole thing is not just part of ld-relay - the entire data store implementation comes from the LD Go SDK which Relay uses. In applications other than Relay there can be different concerns, for instance this could be a Go application that is only reading flags from the database that were put there by Relay and should not ever be populating the database itself. So we would need to build in some extra context-awareness that there isn't a place for in the current internal API. There's also the question of how much of this should be specific to Redis vs. other data store integrations - Redis is somewhat unusual in terms of not necessarily being a persistent store (although many of our customers do use it with persistent storage enabled), and also with other databases like DynamoDB where even a small query is subject to throttling it'd be generally undesirable to poll in this way.
So, for now I can only say that this is a thing we're considering.
@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in redis_impl.go
, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.
@christianladam Yes, that's the marker entry I mentioned, but due to the current internal design of the Go SDK we can't quite do what you're suggesting - Relay does not currently have direct access to the object that's implemented in
redis_impl.go
, it sees only a higher layer that adds caching around those queries, and caching would defeat the purpose of polling in this case. So, again, whatever we do would require some internal API rethinking.
I deleted my comment before your response but yeah there's additional work on the relay side itself required to expose this as you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned.
you definitely don't want the LDClients outside the relay messing with the data population, as you already mentioned
It's not quite as simple as that either. An SDK instance that's configured to use Redis could either be using it in a read-only way, where it expects that Relay (or some equivalent process) is responsible for populating the database, or as a backing store that belongs to the SDK (i.e. much the same way Relay would use it: the app puts flags in the database, so if the app has to restart and can't reach LD right away, the last known data is in there) in which case it's not read-only and we would probably want the SDK to have the same behavior that's been suggested here. Currently the SDK does have some ability to know which of those situations it's in, but that knowledge isn't quite available at the level where it would need to be for implementing this.
An update - Recently the SDK team has been looking at the persistent store design, and we will keep this use case in mind.
Describe the bug We use LD and LD relay for PHP apps where the recommended approach is to have LD Relay persist the data in Redis and PHP connect to Redis.
When the Redis instance is restarted, it loses data and the PHP apps will then return
false
for every evaluation.After Redis has been restarted, LDRelay does not repopulate the data in Redis and we had our flags all return false for 1 hour.
A restart of LD relay fixed the issue.
To reproduce
Expected behavior
Logs No useful logs were found
SDK version
Language version, developer tools PHP 7.4
OS/platform
Additional context Our cache TTL for redis was set at
30s
but we were missing flag definitions for almost an hour until we rebooted LD relay which fixed the issue.