launchdarkly / ld-relay

LaunchDarkly Relay Proxy
Other
112 stars 80 forks source link

POST requests with identify events 502 starting on or around v6.7.8 #193

Closed joshuaeilers closed 2 years ago

joshuaeilers commented 2 years ago

Is this a support request? This issue tracker is maintained by LaunchDarkly SDK developers and is intended for feedback on the SDK code. If you're not sure whether the problem you are having is specifically related to the SDK, or to the LaunchDarkly service overall, it may be more appropriate to contact the LaunchDarkly support team; they can help to investigate the problem and will consult the SDK team if necessary. You can submit a support request by going here or by emailing support@launchdarkly.com.

Note that issues filed on this issue tracker are publicly accessible. Do not provide any private account information on your issues. If your problem is specific to your account, you should submit a support request as described above.

Describe the bug A clear and concise description of what the bug is.

We noticed that on or around v6.7.7 or v6.7.8 the POST requests began failing for us. The v6.7.6 appears to work just fine.

To reproduce Steps to reproduce the behavior.

Install v6.7.8 and watch the POST requests 502.

Expected behavior Should not 502. Maybe it's from the eventsource bump in v6.7.7?

Logs If applicable, add any log output related to your problem.

SDK version The version of this SDK that you are using.

^2.22.1

Language version, developer tools For instance, Go 1.11 or Ruby 2.5.3. If you are using a language that requires a separate compiler, such as C, please include the name and version of the compiler too.

OS/platform For instance, Ubuntu 16.04, Windows 10, or Android 4.0.3. If your code is running in a browser, please also include the browser type and version.

Additional context Add any other context about the problem here.

joshuaeilers commented 2 years ago

This might an environment specific issue for us actually. Still investigating.

eli-darkly commented 2 years ago

@joshuaeilers I can still try to reproduce this in the meantime, but could you clarify about the steps to reproduce: are you talking about POST requests to the Relay Proxy's event endpoints (from SDKs) getting a 502 error, or are you talking about the Relay Proxy making POST requests to deliver the event data to LaunchDarkly and getting a 502 error?

joshuaeilers commented 2 years ago

From sdk to the relay proxy and get a 502 error from our load balancer (at least I suspect the LB but Have not proven this theory yet). Only in prod, our non-prod environments are fine. Get requests on prod also work fine. I don't think we're getting any response back from the LD server itself, but that is just a hunch based on not seeing anything indicating such in the logs.

eli-darkly commented 2 years ago

Hmm, well, it sounds to me like a LB issue is the most likely explanation. I did try to reproduce this using v6.7.8 of the Relay Proxy, and couldn't. I used a minimal config file—

[Main]
logLevel=debug

[Events]
sendEvents=true

[Environment "eli"]
sdkKey=<my test SDK key>

—and then sent some valid event data to it. I got a 202 Accepted status on that request, as expected, and then a second later the Relay Proxy sent the same event data to LD and again got a 202 Accepted.

Also, looking at the code, I don't see any condition in the Relay Proxy logic that would cause it to return a 502. It can return a 503 if event proxying is not turned on, but we don't use 502 for anything. And the code changes in v6.7.6 and v6.7.7 are not related to analytics event posts at all; the name "eventsource" refers to a different thing.

So if there's any possibility that your load balancer configuration may have changed, I think that is a more likely thing to look at. If the load balancer is returning a 502 then it is presumably not forwarding the post to the Relay Proxy at all. One way to verify that, if you can do this conveniently, would be to turn on debug logging in the Relay Proxy (temporarily, since it is very verbose) because when debug logging is enabled, Relay will log something like this every time it receives a post:

2022/06/22 12:42:25.537962 DEBUG: [env: ...0a91] Received 2 events (v3) to be proxied to /bulk
2022/06/22 12:42:25.538235 DEBUG: Request: method=POST url=/bulk auth=*a0a91 status=202 bytes=0

So if requests are happening and getting 502 errors, but Relay doesn't log anything like that, that would indicate that the requests are being blocked before they get to it.

joshuaeilers commented 2 years ago

We just re-deployed and the issue resolved itself, any of these versions are working now. Maybe something got into a funk while it was left running for an extended period of time or something? Definitely doesn't reproduce for me now but will circle back if it resurfaces.

joshuaeilers commented 2 years ago

This is a graph of memory usage of relay proxy over time when we started noticing this problem. It all started around June 15/16th which you can see the mem begin sawtoothing right there.

image
eli-darkly commented 2 years ago

Well, it's normal to see some transient memory spikes if you have a fairly large amount of flag data, because there is some buffering that happens whenever it receives that data from LaunchDarkly. I'm not sure what you mean by "begin sawtoothing" because I don't see any spikes on that graph after June 15, just a gradual smooth increase; whether that is normal or not would depend on whether the size of your data or the number of connected clients had increased or not.

But, again, I'm not sure how to account for a 502 being returned by the Relay Proxy under any circumstances; the code never explicitly sets such a response status, and it doesn't look to me as if we are using any kind of server framework that would ever set such a status for us.

joshuaeilers commented 2 years ago

I guess what I mean is June 16, the used mem began to climb, same time as when our our LD events dropped off the abyss. Timing is just very suspicious. When we re-rolled the relay proxy we didn't touch the LB, so the only variable here was restarting the proxy.

We should have very little flag data, like 4 flags (we're a new LD customer). Also doing streaming:false so not doing any persistent client connections, at least not intentionally. We'll keep an eye on this and see if it resurfaces again.

image
eli-darkly commented 2 years ago

@joshuaeilers Hmm, yeah, that is odd. I'm still having trouble thinking of any condition that could result in a 502 from this service itself, rather than from some other layer in front of the service, but I'll keep thinking about it.