launchdarkly / node-server-sdk

LaunchDarkly Server-side SDK for Node
Other
79 stars 65 forks source link

Upgrade from 6.2.1 to 6.2.2 results in process hanging #264

Closed sprak3000 closed 1 year ago

sprak3000 commented 2 years ago

I was initialing attempting to upgrade from 6.2.1 of the SDK up to 6.4.3 but found it would hang my service after making a successful call to LD. Gradually backed down versions to find an upgrade to 6.2.2 results in the same behavior. A successful run under 6.2.1 results in these logs:

info: [LaunchDarkly] Initializing stream processor to receive feature flag updates
{"http":{"status_code":202},"level":"info","message":"POST events.launchdarkly.com","request":{"body":{"configuration":{"allAttributesPrivate":false,"connectTimeoutMillis":5000,"customBaseURI":false,"customEventsURI":false,"customStreamURI":false,"dataStoreType":"custom","diagnosticRecordingIntervalMillis":900000,"eventsCapacity":10000,"eventsFlushIntervalMillis":5000,"inlineUsersInEvents":false,"offline":false,"pollingIntervalMillis":30000,"reconnectTimeMillis":1000,"socketTimeoutMillis":5000,"streamingDisabled":false,"userKeysCapacity":1000,"userKeysFlushIntervalMillis":300000,"usingProxy":false,"usingProxyAuthenticator":false,"usingRelayDaemon":false},"creationDate":1668879868734,"id":{"diagnosticId":"<redacted>","sdkKeySuffix":"<redacted>"},"kind":"diagnostic-init","platform":{"name":"Node","nodeVersion":"18.12.1","osArch":"x64","osName":"MacOS","osVersion":"22.1.0"},"sdk":{"name":"node-server-sdk","version":"6.2.1"}},"headers":{"authorization":"*****","content-type":"application/json","host":"events.launchdarkly.com","user-agent":"NodeJSClient/6.2.1","x-datadog-parent-id":"<redacted>","x-datadog-sampling-priority":"1","x-datadog-tags":"_dd.p.dm=-0","x-datadog-trace-id":"<redacted>"}},"response":{"body":"","headers":{"connection":"close","content-length":"0","content-type":"application/json","date":"Sat, 19 Nov 2022 17:44:28 GMT","strict-transport-security":"max-age=31536000"}},"tags":{"graphql_node":"","method":"POST","operation":"REST","protocol":"https:","type":"outbound request","url":"/diagnostic"},"timestamp":"2022-11-19T17:44:28.775Z"}
info: [LaunchDarkly] Opened LaunchDarkly stream connection

When I run under 6.2.2, I receive the first two logs saying it is initializing the stream and the successful response but do not see the opened stream connection log ever appear. The request to my service just hangs until I cancel it. I have a different service that is using 6.4.3 of the service with no issue. Only difference I can find between them is the 6.4.3 service also uses the Redis feature store from launchdarkly-node-server-sdk-redis. Otherwise, I can't find any difference between them causing the discrepancy in behavior.

Feels like something subtle, maybe another dependency or something interfering. Hoping someone might have some insight here.

eli-darkly commented 2 years ago

Based on other reports we've seen, the most likely explanation is that another dependency is interfering. I would look first at any kind of middleware product that intercepts HTTP requests. For instance, some versions of Google Cloud Trace are known to have this issue. Briefly, the problem (if this is the same problem) is when a third-party library monkey-patches http.request or https.request, but does not always correctly pass the original parameters through to the original Node function, a mistake that quite a few developers have made because those functions support several different parameter types.

One thing that is surprising to me, if that is the explanation, is that you're not seeing any errors logged— the other reports produced visible errors. But it's possible that different conflicting libraries have different failure modes.

eli-darkly commented 2 years ago

Oh, I forgot to say: one reason that I immediately suspect a problem of this kind is that the v6.2.2 release was specifically about changing the parameters that our code passes to https.request. This was originally motivated by a different third-party incompatibility as described here— but one where the other code was behaving more correctly than ours was. Unfortunately, fixing that in v6.2.2 revealed flaws in other products that happened to be incompatible with the correct parameters, and whenever we've had reports of any kind of problem with network requests that only showed up in 6.2.2, it's turned out to be that.

sprak3000 commented 2 years ago

@eli-darkly

🤔 We are using a Google library. I'll start there and see if anything shakes loose. Generally, what parameters might I look for to see if they are missing?

eli-darkly commented 2 years ago

Generally, what parameters might I look for to see if they are missing?

I can't give a very specific answer to that, because there are so many ways a monkey-patching of https.request could have been implemented.

Basically what you would be looking for is... if you take a look at how Node defines possible parameters to https.request here, any code that is intercepting that call has to allow for any of those possible parameter combinations, and correctly pass them through to the real Node request. So it has to work for just (options), or for (options, callback), or for (url), or for (url, options), or for (url, callback), or for (url, options, callback); and, it has to allow for url being either a string or a URL, and for options being either an object or a string or a URL.

eli-darkly commented 2 years ago

So for instance, if they had defined their interceptor to take only (options, callback) or (url, callback), and they were then calling https.request with those same two parameters in all cases, they would break the LaunchDarkly SDK because in some cases we need to pass (url, options, callback).

eli-darkly commented 2 years ago

Also, I meant to mention that the specific Google library we know to be problematic is @google-cloud/trace-agent— and more specifically, versions earlier than 4.0.0, as reported here. So if you're using that, I'd say that's your answer.

sprak3000 commented 2 years ago

@eli-darkly

We weren't using that Google library, and it wasn't the issue. Still looking into it, but your explanation is exceedingly helpful. I have a better picture now of what to look for through our own code and what we bring in. Much appreciated!

eturino commented 1 year ago

We have the same issue here with Raygun http https://www.npmjs.com/package/raygun-apm When running with node -r raygun-apm/http it hangs and waitForInitialization never resolves.

It works with sdk 6.2.1 but fails with any more recent version.

eli-darkly commented 1 year ago

I'd like to emphasize that if anyone is seeing an error in the general category of "when I updated the LaunchDarkly Node SDK from 6.2.1 to 6.2.2, it started failing"— especially if that failure is "initialization hangs", and especially if there is any accompanying error message along the lines of "The 'listener' argument must be of type function. Received an instance of Object", when it tries to do an HTTP request— it is just about certain that the issue is a badly behaved third-party HTTP middleware package. The reason we're able to be so confident about this is 1. we have investigated many such reports, and 2. the changes from 6.2.1 to 6.2.2 were very localized and basically just boil down to us calling https.request with different parameters than before— valid parameters, but the difference is likely to bring out any problems that exist in HTTP middleware. So if you are seeing such issues, we strongly advise that you take a close look at your dependencies, and for any kind of middleware/metrics/tracing package that could possibly affect HTTP requests, make sure you are using an up-to-date version and make sure there are no known incompatibilities with other packages.

In the most recent report we had like this, the problem turned out to be that the customer was using an obsolete version of the New Relic agent that used an obsolete version of agent-base along with a version of the AppDynamics agent that did not support that version of agent-base. We were able to help them track down that problem, but it was not a LaunchDarkly problem and the customer could have avoided a very long support process by taking a closer look at their dependencies.

eli-darkly commented 1 year ago

I don't say that to be dismissive of the possibility that some other kind of problem could happen that really is our problem, but just to raise awareness of this general issue to save everyone time in troubleshooting.

matt-dutchie commented 1 year ago

@eli-darkly Do ya'll have a shortlist of the most likely dependency culprits? I'm having a hell of a time narrowing things down in our node API codebase...

matt-dutchie commented 1 year ago

Ah -- and of course right after I ask that question I realize we are also using an old version of agent-base through the deprecated @google/maps package. Removing that solved the issue... I'll see myself out... 💃🏻

kinyoklion commented 1 year ago

Looks like this issue was resolved. Closing it.