launchdarkly / java-server-sdk

LaunchDarkly Server-Side SDK for Java
Other
83 stars 56 forks source link

Add stream connection established log #261

Closed JunfeiZhang closed 1 year ago

JunfeiZhang commented 2 years ago

Is your feature request related to a problem? Please describe. Hi team, I am using launchdarkly relay and java SDK on my k8s cluster. I found massive repeating logs caused by the repeating reconnection attempts. E.g 2022/02/09 03:37:34.394381 WARN: [env: ...63c2] Error in stream connection (will retry): EOF 2022/02/09 03:37:34.394429 INFO: [env: ...63c2] Reconnecting in 0.7156 secs 2022/02/09 04:47:47.087964 WARN: [env: ...63c2] Error in stream connection (will retry): EOF 2022/02/09 04:47:47.087994 INFO: [env: ...63c2] Reconnecting in 0.5839 secs This kind of logging happens multiple times a week whenever LD server is unstable.

The issue I am facing is when I set up the central logging system (splunk) and alerts based on it, the stream connection error logs are spamming from relay and SDK. I am wondering if we can add a "Connection successful" log as a green flag so my alert can be cleared based on it.

Describe the solution you'd like A "stream connection established" message is logged from LD relay and SDK would be sufficient. I created https://github.com/launchdarkly/eventsource/pull/34 but I am not sure if that is the best place for the message.

Describe alternatives you've considered N/A

Additional context N/A

eli-darkly commented 2 years ago

I'm unclear on a couple of points:

First: you say that "the stream connection logs are spamming from relay and SDK" when "LD server is unstable". What you've posted here look like Relay Proxy logs, not Java SDK logs; it is saying that the Relay Proxy lost its connection to LaunchDarkly and had to reconnect. Based on how the Relay Proxy is normally used, I would not expect any Java SDK instances that are connected to the Relay Proxy to even notice this condition at all; the SDK has no knowledge of Relay's connection to LD, it has its own connection to Relay which would not be affected by this. So, if you are also seeing some kind of log output from the Java SDK, could you please post what that is as well? Otherwise, it sounds like this issue has nothing to do with the Java SDK at all and would have been more appropriately posted in https://github.com/launchdarkly/ld-relay (or submitted as a support request at https://support.launchdarkly.com/).

Second: have you considered not having an alert in Splunk for this type of message? Stream disconnections are deliberately logged at WARN level rather than ERROR because they are not especially unusual conditions; a reconnection doesn't necessarily mean the "LD server is unstable", it can be caused by random network issues. The Go SDK logic that is used inside of Relay Proxy to connect to LD is designed to handle this transparently. If at least one minute elapses without being able to get a connection, then it will log a message at ERROR level, on the assumption that customers would generally be setting alerts for ERROR rather than WARN. Also, the Relay Proxy status endpoint described here will tell you if it has lost its connection to LD— again, temporary reconnects are not reported as a problem, but if it stays disconnected for at least a minute (that value is configurable, see "disconnectedStatusTime" here) then the status will change to "disconnected" for that environment. So, a better option for your monitoring might be either to monitor only for ERROR-level logging from the Relay Proxy, or to poll the status endpoint, or both.

devmage commented 2 years ago

@eli-darkly Hi there, I'm working with Junfei on this.

On your first point, you are correct and this may have just been a misunderstanding, apologies there. The logs in question are indeed coming from ld-relay, not from an implemented server SDK client.

We tracked down what was creating the log messages, and landed on the eventsource code that Junfei originally sought to amend. That seemed like a simple addition to us.

On your second point, the genesis of this GH issue and the related eventsource PR is that, while we can see in logs when reconnecting is taking place, we do not see a signal in logs when reconnection is successful. As operators supporting an ld-relay deployment, we'd like to have that information for monitoring.

At a higher level though, I agree that the /status endpoint is probably a better place to get information like this. One challenge there however is that the /status endpoint seems to always return 200 OK, regardless of whether the status body reflects "healthy" or "degraded". That means we have to secondarily process the response body to get at the actual status, rather than simply reading the HTTP status code. Not a show-stopping problem, but enough of a hassle that it prompted us to look next at logs to help answer the question.

The ld-relay code for that is seen here, with an implicit 200 status code: https://github.com/launchdarkly/ld-relay/blob/v6/internal/core/relay_core_endpoints_status.go#L225-L233

So, perhaps instead of changing logging, we could propose a change on ld-relay /status, like:

if !healthy {
  w.WriteHeader(500)
}
_, _ = w.Write(data)

This would be a breaking change however for users that expect the endpoint to always return 200. I would expect that you would seek to gate this behavior with a configuration flag, to have it be an optional behavior change. We can go ahead and propose such a flag, however you are likely better equipped than we are for that. 😉

Let us know how you'd like to proceed here. We can open an issue on ld-relay directly, and/or open a support ticket as you suggest, if you would recommend we do such. Thanks for your help up to now. 👍

eli-darkly commented 2 years ago

@devmage For our own issue tracking, as well as for visibility by other customers who might have a similar concern, I do think the ld-relay repo would be the best place to file this.

We can continue discussion over there, but briefly regarding your last comment: while some change in the behavior of the status resource might be helpful, I don't think an HTTP 500 status would make sense for this. Normally 500 is a catch-all for "the server is unexpectedly broken so that we cannot handle this request at all". In the scenario you're describing, the status endpoint is able to handle the request, and Relay is not broken— it is reporting the status of its connections as intended, and the problem is either on the LD service side or in the network. It's not a 503 situation either, because again, the service you're making the request to is Relay, and Relay is available. In such a situation, Relay is perfectly capable of accepting connections from SDKs and giving them the last known flag data; it's just not able to receive updates from LD at the moment. Also, it's common to have Relay configured to service multiple environments, and it could easily be the case that the connections for 9 environments are just fine while 1 of them is interrupted; in that case, the overall status of "degraded" just indicates that something is not perfect, the details in the status response tell you where the problem is, and using a 5xx status for the overall response would be inappropriate.

devmage commented 2 years ago

@eli-darkly Thank you for your guidance! The nuance on ld-relay's ability to continue serving the last known flag data is definitely something to consider.

Based on your reply, we've brainstormed some more monitoring options that we can pursue with ld-relay as-is. We can go ahead and close this issue, and I don't believe we need to open a subsequent issue on ld-relay at this time. Sorry again for the noise here. We'll see about opening a support ticket in the future, too. Take care!