allFlagsState spend more time resolving result when under heavy load

randing89 commented 1 year ago

Describe the bug We found under heavy load allFlagsState turns async requests into sort of waterfall manner. This significantly reduced the server performance causing timeouts.

Assume we have 50 requests coming at the same time. They all need to get all the flags.

Based on the test in https://github.com/launchdarkly/js-core/issues/229. Despite the poor performance let's say 700ms is the "normal" latency for one single function call. If we fire 50 requests at the same time, I am expecting each request to be completed in around 700ms. Maybe with some additional overhead.

However when running the following test code. Each request took 35s+ to complete that's roughly 700ms*50. It seems the client has some sort of internal state that make requests to wait.

In addition to this, SDK uses a lot of RSS memory but not much of heap which is not common.

const ld = require("@launchdarkly/node-server-sdk");

function byteToMB(bytes) {
  return `${Math.floor(bytes / 1024 / 1024)}MB`;
}

async function fetchFlags() {
  const client = ld.init("{sdk_key}");
  client.once("ready", async () => {
    console.log(
      `heap: ${byteToMB(process.memoryUsage().heapUsed)}, rss: ${byteToMB(
        process.memoryUsage().rss
      )}`
    );
    Array(50)
      .fill()
      .forEach(async (_, i) => {
        const start = performance.now();
        await client.allFlagsState({
          key: Math.random(),
        });
        console.log(
          `${i + 1} call took ${performance.now() - start}, heap: ${byteToMB(
            process.memoryUsage().heapUsed
          )}, rss: ${byteToMB(process.memoryUsage().rss)}`
        );
      });
  });
}

fetchFlags();

Expected behavior allFlagsState calls should be independent to each other.

SDK version Issue happens in @launchdarkly/node-server-sdk@8.1.1 launchdarkly-node-server-sdk@7.0.2 launchdarkly-node-server-sdk@7.0.0

It was working fine in launchdarkly-node-server-sdk@6.4.3

Language version, developer tools Node 16 & 18

kinyoklion commented 1 year ago

I've put most of my comments in the other issue, but adding some here.

The largest part of the performance difference seems to be because of the polyfill from how we were compiling typescript. In my test cases I was seeing about a third less RSS usage for the newer SDK after this change than the old one.

This behavior is impacted by that, but the "spreading" effect will be similar. The code is actually more concurrent now. So when you start many things they interleave heavily because the node scheduler has many opportunities. Versus if they were sequential each would take whatever their full execution time was whilst blocking every other task during that period of time.

It wouldn't be exactly like this, but this is the gist of what happens: If you had 2 tasks with 1 second of work, then each would take one second if they executed sequentially. With a total execution of 2 seconds. If you have 2 concurrent 1 second tasks, and they execute concurrently (basically alternating between them until they are complete), then the total task execution time would be 2, but the individual tasks would also be 2.

The old code was less concurrent and more sequential, because it used callbacks internally, so code would sequentially execute more code when it didn't hit an IO path.

I am looking at some minor adjustments to avoid some async paths when they are not needed. But chances are that they will never behave exactly the same. With the best goal to increase the performance per evaluation, which decreases the overall time for both situations.

I did see this same behavior though even with the older SDKs depending on the payload. But they were much faster before the Typescript change.

Thanks, Ryan

randing89 commented 1 year ago

Hi,

Thanks for looking into this.

If you had 2 tasks with 1 second of work, then each would take one second if they executed sequentially. With a total execution of 2 seconds. If you have 2 concurrent 1 second tasks, and they execute concurrently (basically alternating between them until they are complete), then the total task execution time would be 2, but the individual tasks would also be 2.

That make sense. I have updated the issue title to reflect that.

In our code we have logic like this. We need to send the flags over to another service.

async function requestExternalService() {
...
const flags = (await client.allFlagsState()).toJSON()
await fetch({ ... send over flags to other service ... })
...
}

Because this interleave behavior, now requestExternalService will take a significant time to finish under the heavily load. In our production environment the execution can sometimes exceed our timeout setting of 5 seconds.

kinyoklion commented 1 year ago

@randing89,

It should be much faster already with the update. But I am looking into the code still. I am investigating the performance with a callback based conversion, and will see how that is. Or maybe we can find some middle solution.

The problem with just callbacks of course is that it makes the code much harder to follow and maintain than async.

Thank you, Ryan

randing89 commented 1 year ago

@kinyoklion Thanks!

On a side note, the RSS memory usage in @launchdarkly/node-server-sdk@8.2.0-beta.1 looks much much better than all the previous version! Down from 1221MB to 280MB in compare with 6.4.3 when running the demo above.

This is how it looks like in production (using 6.4.3). The blue line is RSS memory, it was causing kernel OOM kills occasionally.

kinyoklion commented 1 year ago

@randing89 I've pushed a new pre-release 8.1.3-beta.1. The first set of changes got released in 8.1.2 (instead of 8.2.0 as my pre-release was).

The changes in 8.1.3-beta.1 are not yet fully validated, but I want to see how they work for you with regards to performance and memory usage.

The major change is to make the main path as synchronous as possible with callbacks. Which does make performance much better, and locally my memory usage was similar, but using callbacks have the potential to increase the memory footprint. The performance numbers I am getting are quite a bit different than yours at baseline, but for me this was another 2x improvement. Which, for me, is basically the same as 6.4.3. But you were seeing better performance for 6.4.3 than I am. In regards to memory this may be somewhat of a memory versus performance trade-off.

The next change it has is a change to segment handling. Your payload involves relatively large segments, so this contains an optimization for segment lookup with more than 100 items.

In testing the concurrent use case is still quite a bit different. For me 6.4.3 does handle the individual requests in less time, but the overall time is 2x the overall time for me.

I would appreciate any information about how 8.1.3-beta.1 runs for you.

Thanks, Ryan

kinyoklion commented 1 year ago

I am seeing some interesting additional behavior with regard to node.

For 8.1.3-beta.1 I get about 50% the performance in node 16 compared to node 18. For node 16 8.1.3-beta.1 and 8.1.2 perform basically the same. For node 18 8.1.3-beta.1 is greater than 2x the performance of 8.1.2.

So it is possible there is something that is problematic in specifically node 16.

kinyoklion commented 1 year ago

I've identified the node 16 problem. Internally we have a discriminated union implemented using classes, changing it to just interfaces substantially improves the performance in node 16.

randing89 commented 1 year ago

Thanks for the update.

With node 18, 8.1.3-beta.1 shows very similar result with 6.4.3 in the individual test. It solved the issue in #229 For concurrent test it is interesting , the latency decreases with # of requests fired in 6.4.3. Is it because of some sort of cache kicked in? I don't see such behavior in 8.1.3-beta.1.

Here are my numbers:

8.1.3-beta.1

6.4.3

kinyoklion commented 1 year ago

@randing89

I think that in 6.4.3 the interruptions for the individual requests are decreasing as more of them complete. Where in 8.1.3-beta.1 more of them are remaining in-flight and serviced concurrently.

So, I think with this particular test it would be of interest to start a timer before and after the entire set as well. I suspect that the two numbers will be substantially different.

There isn't any cache that isn't primed as soon as initialization is complete. In regards to the SDK. In regards to node itself there are probably many things changes as calls are made. Lazy compilations and so forth.

The new code is very close to the level of sequential execution as earlier SDKs now, but node still seems to be interleaving the operation more.

It could be related to some stack overflow mitigation which both SDKs have, but the old SDK was using an external library for part of that execution. Basically when everything is callbacks we have to use callback based loops (because, if the SDK is using redis/dynamo/etc the flag may not be in memory). These would be prone to stack overflows, so when the loop is over a certain size we instead defer execution. We used to defer execution using setImmediate, now we are doing so by resolving a promise (for broadest compatibility with edge environments and other more restricted js environments.). Theoretically they should have similar behavior.

So I can check into both of those.

Thank you, Ryan

kinyoklion commented 1 year ago

I've discovered how to re-create this behavior. So, while several allFlags were executing concurrently the individual flag evals within it were sequential, where they used to be concurrent as well. Making them produces similar behavior. Not certain that is the best or directly equivalent yet, because the worse time is worse and the best time better.

Initial result:

8.1.3-beta.1 result:

Note how 8.1.3-beta 1 takes less total time.

I think the amount of concurrency basically saturates the capability which makes things execute somewhat more sequentially.

I'll see if there is some better balance that can keep individual times lower.

Thank you, Ryan

kinyoklion commented 1 year ago

This was excluding the set optimization changes for segments that was in 8.1.3-beta.1.

kinyoklion commented 1 year ago

@randing89 Hey Ran,

Hopefully the last pre-release before we get things merged and resolve these two issues.

I've released 8.2.1-beta.1. This includes all the changes to the async behavior.

It doesn't contain the additional optimization for large segments, but that would come in a follow-up and hopefully this addresses the remaining concerns. That change will bring a little more performance as well, but I don't want to combine too many changes and would like some extra validation time on that change.

In testing I have noticed that the run to run variance for 50 concurrent is variable. If I loop doing 50 then the performance improves for few iterations. Likely as node is optimizing things.

Thanks, Ryan

randing89 commented 1 year ago

@kinyoklion Thanks we will give it a try and post back the result later. Thanks

randing89 commented 1 year ago

@kinyoklion Upgrading from 6.4.3 to 8.2.1-beta.1 we saw a 50ms to 100ms increase in p99 when calling allFlagsState depending on the volume of requests. Notice it's declining in the chart below because our traffic getting smaller at off-peak.

Lines in the screenshot are p50, p90 and p99 latency

The RSS memory looks much stable, before the change it constantly spike over 2G per process.

kinyoklion commented 1 year ago

@randing89 Thank you for the feedback. Good to see the memory usage remained low at least.

You were also seeing degraded performance for 7.0 compared to 6.4.3, so I must now consider that the algorithmic changes we added in 7.0 may be contributing.

I will let you know what I find.

Thank you, Ryan

kinyoklion commented 1 year ago

@randing89 8.2.1 has been released and should provide similar performance to 6.4.3.

I am still working on additional changes that should result in greater performance, but it may take a little while for those to be validated.

Thanks, Ryan

randing89 commented 1 year ago

@kinyoklion Since bumping to 8.2.1 beta 1 and 8.2.1 we are getting millions of logging of Variation called before LaunchDarkly client initialization completed (did you wait for the 'ready' event?) - using last known values from feature store I can't seem to find a place that's calling variation before init because we only listen to new request once LD is ready. Also I assume even that happens it shouldn't be persistent as the init will eventually finish?

When I tested locally if I call before init I got another message Variation called before LaunchDarkly client initialization completed (did you wait for the'ready' event?) - using default value. Do you know why it may happen?

randing89 commented 1 year ago

Update: according to https://docs.launchdarkly.com/sdk/concepts/data-stores#initialization-when-working-with-persistent-feature-stores looks like it is from feature stores. But in our set up our LD client is not configured with a feature store.

kinyoklion commented 1 year ago

@randing89 I will look into it. It could be a problem with the conversion to callbacks. I will get it patched if it is.

kinyoklion commented 1 year ago

@randing89 Do you see any other messages?

I double checked locally that I do not see this for either single variation or all flags evaluations. I need to determine what is different.

Do you have any special configuration?

Thank you, Ryan

kinyoklion commented 1 year ago

@randing89 I did find a minor problem in the code around this message. I am fixing that. I do not see why it would cause that to be logged in volume.

When doing individual variation calls do you use callbacks or promises?

Thank you, Ryan

kinyoklion commented 1 year ago

I fixed the bug I was able to find and released that in 8.2.2. It would have only have had an effect for a flag that was called before initialization though. It would have executed the callback twice in that scenario.

So I am looking for anything that would cause early execution of the ready callback.

Thank you, and sorry about that.

kinyoklion commented 1 year ago

I found 2 additional situations where this warning could be produced in error. Neither of them specific to 8.2.1. I don't think they would have been the cause of the condition for you either.

The first is if an event handler was registered before the client was ready, then it would get a change event before the ready event. If an evaluation was then done in that callback it would have an initialized store, but the client would not know it was ready.

The second was much less likely. Which was after the SDK was initialized it received an invalid payload it would become un-initialized.

I am holding those for release until I have done more investigation.

kinyoklion commented 1 year ago

I would be especially interested if you encountered any messages containing. "Stream received invalid data in ".

randing89 commented 1 year ago

I would be especially interested if you encountered any messages containing. "Stream received invalid data in ".

Yep we received about 12k Stream received invalid data in "patch" message in the past 24 hours.

randing89 commented 1 year ago

When doing individual variation calls do you use callbacks or promises?

We use promise.

kinyoklion commented 1 year ago

I would be especially interested if you encountered any messages containing. "Stream received invalid data in ".

Yep we received about 12k Stream received invalid data in "patch" message in the past 24 hours.

Well, then that is likely the problem. The changes I have would likely prevent the message, but we need to understand why the patches are not parsing.

It should log some additional data. It would be good if I could get if it was a flag or segment and the name.

kinyoklion commented 1 year ago

The entire message would be best, but I realize you may need to provide that through support instead of here.

randing89 commented 1 year ago

The entire message would be best, but I realize you may need to provide that through support instead of here.

I didn't see any additional data. That message is the entire message. There are few hundred Received I/O error ([object Object]) for streaming request - will retry as well.

kinyoklion commented 1 year ago

@randing89 Thank you for that information.

kinyoklion commented 1 year ago

@randing89 I was able to capture an offending payload with the debugger. So I have something to work from. Again thank you for your patience.

kinyoklion commented 1 year ago

@randing89 The source of the invalid patch was a problem with the allocation algorithm used in the event source. Basically if we received a chunk that contained the last part of one message and the start of another message (but not the entire message), then we would not handle the next part of that message correctly. Then that would cause the client to think it was no longer initialized.

Generally this didn't happen very often, but it could have been amplified by certain patterns. Like automated updates of large segments resulting in many back-to-back updates.

Both of these issues are resolved in 8.2.3. I let it soak running a test over the weekend and didn't encounter any invalid patches.

In regards to the I/O error, that happens occasionally and should not be an issue. It is most often associated with our back-end deploys. A server goes away and a new connection must be established.

Thank you, Ryan

randing89 commented 1 year ago

Generally this didn't happen very often, but it could have been amplified by certain patterns. Like automated updates of large segments resulting in many back-to-back updates.

Yes we have a system on the top of LD to automate the rollout.

@kinyoklion Thanks will try and post back the result.

randing89 commented 1 year ago

@kinyoklion 8.2.3 has been deployed to production for 2 days. We haven't seen the issue come back. Thanks!

kinyoklion commented 1 year ago

@randing89 That is great news. Are we good to close these issues?

randing89 commented 1 year ago

@kinyoklion All good now. I can close the issues. Thanks!

launchdarkly / js-core

allFlagsState spend more time resolving result when under heavy load #230

Here are my numbers: