Question: node grpc versus grpc-js performance

hardboiled commented 3 years ago

My team just updated our internal tools to use grpc-js in lieu of grpc. A primary motivation for this is that we want to be able to use later versions of node and no longer use a library that's deprecated.

After a couple weeks, we've identified that the node 12 + grpc to node 14 + grpc-js upgrade has resulted in significant performance degradation for one of our endpoints. This endpoint makes a protobuf network call for a repeated type that returns 100+ objects. Locally we're seeing a 3x latency increase between the two versions of the endpoint, but we haven't debugged the issue more than just swapping the libraries so far, so there is a chance that the issue could be specific to our app.

Is this a known tradeoff between the two libraries, potentially for example that serialization takes longer with the raw JS version? Are there recommended optimizations or configurations that I might not be following that would be contributing to our problem? I would really prefer to keep grpc-js and try to optimize code rather than doing a rollback if possible.

murgatroid99 commented 3 years ago

Yes, grpc-js is known to be slower than the older grpc. I don't have any application-level optimizations to suggest to you; I believe the increased latency comes from inefficiencies in the library itself, and/or in the runtime.

hardboiled commented 3 years ago

OK, thanks for confirming!

Based on your comment here,

The intended and recommended usage of gRPC is to create a connection once and then use it to make many requests, so in our own benchmarks we generally focus on the request time, and consider the connection establishment time to be amortized over many requests.

we're investigating our connection pooling to see if we can make some improvements that will mitigate the performance impact. Will report back here with findings.

murgatroid99 commented 3 years ago

Right, that is a recommendation that applies to both implementations, though the exact performance impact may differ between the two. In general, we do not recommend that you do your own connection pooling. Rather, you should usually create a single Client object and use it to make all requests. The exception is if you are running into streams per connection limits on the server, in which case you will need to make multiple clients and force them to use different connections by setting the grpc.use_local_subchannel_pool channel option to 1.

If you want to distribute requests among multiple servers running the same service, you can use the round robin load balancing config to have the client do that, as described in https://github.com/grpc/grpc-node/issues/1307#issuecomment-601312421.

hardboiled commented 3 years ago

I spent the last day debugging this, and here are my findings so far... For background, I have a go-based GRPC server running in the cloud, and previously I was using a node-based grpc client also running in the cloud to make requests to this server.

Upgrading to grpc-js from the legacy grpc resulted in a 4-5x performance degradation on my grpc invocations. Initially I suspected data serialization to be the issue due to our large response bodies, but from running various profilers, it seems like the http2 connections are sitting in async time much longer. I confirmed that data serialization wasn't the problem by running both the node client and the go server locally (without TLS, not sure if that will make a difference here), and each call dropped to a 100ms-ish time, which is dramatically lower than the typical call time.

Are there any connection settings that I should be looking into to improve the connection responsiveness?

Seems odd to have connections hanging so much longer than with the legacy package.

Just in case this info might be helpful, I ran my client with GRPC_TRACE=call_stream and GRPC_VERBOSITY=DEBUG in both a remote server and local server scenario, and the data frames received from the remote server are much smaller:

# remote
2021-06-16T10:43:11.965Z | call_stream | [0] receive HTTP/2 data frame of length 9
2021-06-16T10:43:11.965Z | call_stream | [0] receive HTTP/2 data frame of length 4087
# ...

# local
2021-06-16T10:45:56.470Z | call_stream | [0] receive HTTP/2 data frame of length 16384
2021-06-16T10:45:56.471Z | call_stream | [0] receive HTTP/2 data frame of length 16384
# ...

Note sure if this is a symptom of the behavior I mentioned or could help root cause some potential improvements. Thanks!

murgatroid99 commented 3 years ago

Can you clarify what you mean by "it seems like the http2 connections are sitting in async time much longer"?

hardboiled commented 3 years ago

With grpc-js

With legacy grpc library

What I mean is that I wrote a program to make ten grpc calls to the grpc-go server endpoint from the grpc node client and I tested this with the grpc-js library and the grpc legacy library for this node client.

I then used bubbleprof to profile the calls, and it's saying that the time waiting on asynchronous calls is ~4x when using the grpc-js library (I included this in the screenshots).

In terms of why I don't think serialization is the problem, these network calls in the screenshots above are against a remote grpc-go server running in AWS. When I boot this grpc-go server locally on my laptop and run the grpc-js client against it, the program only takes 100ms+ for the individual grpc calls to succeed instead of multiple seconds per call. Thus, the data serialization time involved in hydrating the message objects likely is insignificant to the actual call-time.

Let me know if you need any additional clarification.

murgatroid99 commented 3 years ago

When it says "async operations were pending", is it talking about time in which no code was running and the Node process was simply sleeping waiting for something to happen, or is it talking about time in which code may or may not be running, but at least one async operation was also pending? Or something else?

In fact, if each request is taking multiple seconds, what makes you think 84 ms of async waiting time is a significant factor? And what is the latency is for running the legacy grpc client against the local server? How does that compare to the "100ms+" with grpc-js? Do you still see the same "4-5x performance degradation" as in the other scenario?

That large difference in latency when talking to a local vs remote server makes me think there's something going on here other than just the raw performance of the two libraries. I think it would be helpful to see a full trace log of your main benchmark (grpc-js connecting to a remote server). You can get that by setting the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG.

I do agree that serialization time is likely not the problem, primarily because the serialization code is probably essentially the same in the two tests.

hardboiled commented 3 years ago

"is it talking about time in which no code was running and the Node process was simply sleeping waiting for something to happen, or is it talking about time in which code may or may not be running, but at least one async operation was also pending? Or something else?" I believe it's "sleeping waiting for something to happen" but I can double-check Monday when i get into the office.

"In fact, if each request is taking multiple seconds, what makes you think 84 ms of async waiting time is a significant factor?"

It's 84,044ms if I'm reading the image i posted right, which is almost the entire runtime of the program. Are you referring to something else?

"And what is the latency is for running the legacy grpc client against the local server?" I haven't run the legacy grpc client yet because the results for the grpc-js were reasonable. I'm happy to run it Monday if you think that would be helpful.

"That large difference in latency when talking to a local vs remote server makes me think there's something going on here other than just the raw performance of the two libraries."

That's what I'm hoping as well but I've tried to isolate the changes made to be only the two libraries and I can't explain why the code seems to be idling for longer. I did compare the output in debug mode yesterday with the flags you suggested between grpc-js and grpc, but the logging info was pretty different between the two. I didn't see anything obvious other than with grpc-js the http2, frames were 1/4 of the size when running against the remote server as opposed to locally. I can post the results of the differences between the legacy grpc client and the grpc-js client against the remote server Monday if you're interested.

On Fri, Jun 18, 2021, 5:01 PM Michael Lumish @.***> wrote:

When it says "async operations were pending", is it talking about time in which no code was running and the Node process was simply sleeping waiting for something to happen, or is it talking about time in which code may or may not be running, but at least one async operation was also pending? Or something else?

In fact, if each request is taking multiple seconds, what makes you think 84 ms of async waiting time is a significant factor? And what is the latency is for running the legacy grpc client against the local server? How does that compare to the "100ms+" with grpc-js? Do you still see the same "4-5x performance degradation" as in the other scenario?

That large difference in latency when talking to a local vs remote server makes me think there's something going on here other than just the raw performance of the two libraries. I think it would be helpful to see a full trace log of your main benchmark (grpc-js connecting to a remote server). You can get that by setting the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG.

I do agree that serialization time is likely not the problem, primarily because the serialization code is probably essentially the same in the two tests.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-node/issues/1797#issuecomment-864293987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHSO6DJY7IMT4IP3XSNHYTTTO6Z3ANCNFSM45JQL2VQ .

murgatroid99 commented 3 years ago

I guess I misread the numbers in that image. The difference there might just come down to differences in how async operations are represented internally in the two libraries. I don't think you can conclude anything solid from it.

I don't want the differences between the logs. I want to see the grpc-js log. This is about latency; the most important thing is the relative time between all of the different logs.

hardboiled commented 3 years ago

Ok, will get the grpc-js logs to you monday then. Thanks for following up!

On Fri, Jun 18, 2021, 5:52 PM Michael Lumish @.***> wrote:

I guess I misread the numbers in that image. The difference there might just come down to differences in how async operations are represented internally in the two libraries. I don't think you can conclude anything solid from it.

I don't want the differences between the logs. I want to see the grpc-js log. This is about latency; the most important thing is the relative time between all of the different logs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-node/issues/1797#issuecomment-864308355, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHSO6FT2H6GWOJ2JTNXNN3TTPEZNANCNFSM45JQL2VQ .

hardboiled commented 3 years ago

Here is the grpc-js log output.

murgatroid99 commented 3 years ago

I assume that is the grpc-js client connecting to the remote server. Can you also share an equivalent log with the grpc-js client connecting to a local server, for comparison?

hardboiled commented 3 years ago

Sure, here is the localhost output.

murgatroid99 commented 3 years ago

That's interesting. The client connecting to the remote server receives about 200 data frames per second, while the client connecting to the local server receives 170 data frames in about 0.07 seconds. That tells me that the bottleneck is the network communication, not any of the local processing.

In the first log, the 4KB chunks further split into pairs or triples of smaller data frames feel indicative of something, but I'm not sure what it means yet.

I also have to mention that we recommend using the same client when making multiple requests, instead of creating a new one every time. It doesn't have much of an impact on the benchmark in this case because they're all using the same underlying connection and it looks like it's using cached DNS lookups.

hardboiled commented 3 years ago

Yeah, the chunk difference was the only thing I noticed between the logs as something that might be some sort of clue. Keep in mind that when running against localhost, I have to use credentials.createInsecure() since I don't have valid certs locally, but I figure this detail isn't actually affecting the local performance.

I tried to see if the chunking behavior was different with the legacy library, but I found the logs to be a bit cryptic, so I wasn't able to determine if it's sending more bytes per frame. That being said, when comparing the call-time between the two across 100s of invocations, the difference is 4-5x in latency with a remote server, and this is why we had to rollback our deployment to use the legacy library to avoid production issues.

hardboiled commented 3 years ago

message WeK8sGenericService {
  string cluster = 1;
  string namespace = 2;
  string name = 3;
  map<string, string> annotations = 4;
  repeated string hosts = 5;
}

message ListWeK8sGenericServicesResponse {
  repeated WeK8sGenericService services = 1;
}

service Cluster {
  rpc ListWeK8sGenericServices(google.protobuf.Empty) returns (ListWeK8sGenericServicesResponse) {}
}

If it helps, this is the message definition.

murgatroid99 commented 3 years ago

I think what we're seeing there is actually related to TLS, at least partially. When I run my own local large message test with TLS enabled, I see the same pattern of 9 bytes being separated from each chunk into separate chunks. It's hard to tell if that's actually the cause of the performance issues, or if it's incidental.

hardboiled commented 3 years ago

Update: current solution: If you reference my protobuf definitions above, we were adding k8s object annotations (map<string, string> annotations) into the repeated WeK8sGenericService services. We included these for developer velocity purposes, so that lower-level services wouldn't need to have extra protobuf field definitions updated every time our higher-level graphql server wanted something from the annotations.

When I removed this field and replaced it with one-to-one field definitions for what we actually were pulling, the call time dropped dramatically on local testing to around 100-200ms (Note that all other protobuf calls we benchmarked for grpc-js against grpc legacy were comparable, so we were only concerned about this endpoint).

Notes: Why did this help? Normally I would think that perhaps the amount of information being returned from the annotations block was just drawfing all else. However, when I outputted the results to a file the file size didn't seem to be proportional to the performance gained from excluding it. I'd have to run more tests to be sure though.

I also tested just using a raw string instead of a map type, and that didn't improve performance at all.

Thus, I'm not sure exactly why this was causing performance problems for my use-case. I would also say that legacy grpc was impacted as well, since it was slow without these changes, just not as slow as our grpc-js setup.

Thus, for now, we're happy with just reducing our message size with these changes and leaving it there, even though it's a bit of an inconvenience to have to update field definitions in multiple places every time we want to expose something from annotations to our graphql server.

If you want me to investigate further, I could pull some hard metrics comparing with and without the annotations block for grpc legacy and grpc-js, but for now, I'll regard this issue as resolved. Thanks for your help!

murgatroid99 commented 3 years ago

I am not surprised that changing the encoding didn't help. The chunking behavior we saw seems more indicative of a problem with how the serialized messages are sent over the wire than a problem with how the messages are serialized. Hopefully that means that we can reproduce the problem by sending any large messages over the network, instead of relying on your specific protocol.

bobaaaaa commented 2 years ago

I guess we run into the same problem. We downgrade to the (deprecated) grpc package. That helps, but is unsatisfying. I could also provide trace logs if that helps.

edvardchen commented 2 years ago

grpc-js uses built-in http module and node grpc uses their own one. That means grpc-js can leverage multiple cpu core in Node.js cluster mode but node-grpc can't

Farenheith commented 10 months ago

I also had performance related problems when I tried to upgrade from grpc to grpc-js (server-side), but what I've seen is the application almost doubling the number of task when using the new library. I just downgraded to grpc, and the number of tasks came back the the previous one, both tests were on node 14. The autoscaling here is configured by CPU usage, so, the conclusion is that grpc-js demands much more cpu than grpc. Could it be possible that this will not be the case if I also upgrade to the latest node version?

kirillgroshkov commented 10 months ago

almost doubling the number of task when using the new library

Just to add to that - the speed of streaming the query from Datastore to Node halved with the update from grpc to grpc-js (e.g from 2000 rows per second down to 1000 rows per second). To keep in mind.

grpc / grpc-node

Question: node grpc versus grpc-js performance #1797