grpc / grpc-node

gRPC for Node.js
https://grpc.io
Apache License 2.0
4.47k stars 647 forks source link

When I perform performance testing, the grpc client receives Call canceled. Why is this? #2641

Open lwmxiaobei opened 9 months ago

lwmxiaobei commented 9 months ago

Problem description

When I perform performance testing, the grpc client receives Call canceled。 I found that the memory of my online client service has been unable to change. This service will request a lot of grpc services. I grabbed a heap memory snapshot and found that there are many timeout objects. This seems to be caused by the link not being released? Looking at grpc-js, I found that there is a client parameter grpc.client_idle_timeout_ms, so I set it to 30s, and then conducted a stress test. I found that the memory will slowly decrease after it grows. Maybe it is because the connection is released? But I found that during the stress test, 154,000 total requests were Call canceled 98 times. Is this because the client channel changed to the IDLE state and all clients threw Call canceled? If not, please answer, thank you

Before pressure test:

image

After pressure test:

image

Reproduction steps

Environment

Additional context

lwmxiaobei commented 9 months ago

Timeout object:

image
murgatroid99 commented 9 months ago

First, the cancelled calls are almost certainly unrelated to the idle timeout setting, or to the large number of timers you are seeing. The cancelled calls may be caused by the server getting overloaded and shedding load by cancelling requests. Based in the numbers in your report, this is causing errors at a rate of less than 0.1%, so I am not concerned.

Second, there are a couple of possible causes for the accumulation of timers that you are seeing. Your "Timeout object" image shows a keepalive timer. There is only one keepalive timer per connection, so if that object is representative of all of them, that likely means that you are creating a lot of connections. If you are creating a separate client object for each request, that could cause that behavior. It would be better to create a single client and use it to make every request.

On the other hand, if that keepalive timer is not representative, and most of the timers are idle timers, then you may instead be encountering a known inefficiency in how idle timers are handled. Specifically, if your test always makes a single request, waits for it to complete, and then makes another request, then that will cause a lot of timers to be cancelled.

lwmxiaobei commented 9 months ago

First, the cancelled calls are almost certainly unrelated to the idle timeout setting, or to the large number of timers you are seeing. The cancelled calls may be caused by the server getting overloaded and shedding load by cancelling requests. Based in the numbers in your report, this is causing errors at a rate of less than 0.1%, so I am not concerned.

Second, there are a couple of possible causes for the accumulation of timers that you are seeing. Your "Timeout object" image shows a keepalive timer. There is only one keepalive timer per connection, so if that object is representative of all of them, that likely means that you are creating a lot of connections. If you are creating a separate client object for each request, that could cause that behavior. It would be better to create a single client and use it to make every request.

On the other hand, if that keepalive timer is not representative, and most of the timers are idle timers, then you may instead be encountering a known inefficiency in how idle timers are handled. Specifically, if your test always makes a single request, waits for it to complete, and then makes another request, then that will cause a lot of timers to be cancelled.

I also found a lot of Objects, which are all getInfo closures. As shown in the picture, these objects occupy nearly 50% of my heap memory. I don’t know why these objects still exist in the heap memory after the request. Can you help me?Thanks

image
murgatroid99 commented 9 months ago

This could also be caused by creating many clients. Again, if you are doing that, you should instead create a single client and use it to make all requests.

lwmxiaobei commented 9 months ago

This could also be caused by creating many clients. Again, if you are doing that, you should instead create a single client and use it to make all requests.

Thank you, I would like to confirm whether creating a service through new essentially means establishing a connection.

var client = new routeguide.RouteGuide('localhost:50051', grpc.credentials.createInsecure());

For instance, in my scenario, the client needs to connect to a dozen or even several tens of services. Could this cause any issues? Are there best practice solutions for creating clients for multiple services, or perhaps code examples? These occurrences indeed suggest that a large number of clients might be created, so your response is greatly appreciated.

murgatroid99 commented 9 months ago

There is nothing wrong with creating a separate client object for each service you want to connect to. The problem is creating a separate client object for each request to the same service. I was only guessing that you were doing that, based on the information you provided.

If you share your test code, we will have a better chance of finding the causes of the things you are observing.

lwmxiaobei commented 9 months ago

There is nothing wrong with creating a separate client object for each service you want to connect to. The problem is creating a separate client object for each request to the same service. I was only guessing that you were doing that, based on the information you provided.

If you share your test code, we will have a better chance of finding the causes of the things you are observing.

Thanks! Very willing to share my test code This is the code to build the client, Obtain the configuration information, then load the proto file, parse it, and finally build the client:

const protoLoader = require('@grpc/proto-loader');
const grpcCore = require('@grpc/grpc-js');
const path = require('path');
const fs = require('fs');
module.exports = function loadGrpcClient(app) {
    app.addSingleton('grpcClient', async (config, app) => {
        config.default = app.config.grpcClient;
        return await createClient(config, app);
    });
}

async function createClient(config, app) {
    const service = {};
    await getAllServices(path.join(app.baseDir, config.proto), config, service);
    return service;
}

async function getAllServices(photoPath, config, service) {
    if (!fs.existsSync(photoPath)) {
        throw new Error('no proto file');
    }
    const photoFileList = fs.readdirSync(photoPath);
    for (const photoName of photoFileList) {
        const photoFilePath = path.join(photoPath, photoName);
        const stats = fs.statSync(photoFilePath);
        if (stats.isFile() && path.extname(photoName) === '.proto') {
            const protoObj = await protoLoader.load(photoFilePath, {
                keepCase: true,
                longs: String,
                enums: String,
                defaults: true,
                oneofs: true
            });
            const packageDefinition = grpcCore.loadPackageDefinition(protoObj);
            loadClients(packageDefinition, service, config);
        }
    }
}

function loadClients(protoDescriptor, service, config) {
    if (!protoDescriptor)
        return;

    if (protoDescriptor.fileDescriptorProtos) {
        return;
    }

    for (const key in protoDescriptor) {
        if (protoDescriptor[key].name === 'ServiceClientImpl') {
            buildClient(protoDescriptor[key].service, protoDescriptor[key], service, config);
        } else {
            loadClients(protoDescriptor[key], service, config);
        }
    }
}

function buildClient(protoDescriptor, serviceName, service, config) {
    const client = new serviceName(config.endpoint, grpcCore.credentials.createInsecure());
    for (const methoc in protoDescriptor) {
        service[methoc] = async (params) => {
            return await new Promise((resolve, reject) => {
                client.waitForReady(Date.now() + config.deadline, (error) => {
                    if (error) {
                        reject(error);
                    } else {
                        client[methoc](params, (e, res) => {
                            if (e) {
                                reject(e);
                            } else {
                                resolve(res);
                            }
                        })
                    }
                })

            })
        }
    }
}

proto file:

syntax = "proto3";

option java_multiple_files = true;
option java_package = "io.grpc.service.order";
option objc_class_prefix ="RTG";

package gd.pts.order;

import "google/protobuf/timestamp.proto";

service Order {
  rpc getDetailByCode(GetDetailByCodeRequest) returns (OrderDetail) {};
}
message GetDetailByCodeRequest {
    string code = 1;
}

message OrderDetail {
    string order_no = 2;
    uint64 user_id = 3;
    uint32 business_id = 6;
}

and then call the method:

const res = await this.app.grpcClient.getDetailByCode({code: 'xxxxx'});

50 concurrent performance tests, a total of 11.7w requests in 5 minutes

After a period of time after the stress test, I performed a heap memory analysis and found that these objects occupied most of the heap memory. At this time, my service had no requests, but these objects were still not recycled.

image image
lwmxiaobei commented 9 months ago

In addition, there are no grpc options set here.

murgatroid99 commented 9 months ago

Those objects you see are socket channelz reference objects. There is one for each active connection. And with the way you construct your clients, they should be able to share connections, so there should only be one object in there. But the sizes seem to indicate that there are 2000 of them, and I don't see why that would happen from the code you shared.

Can you share trace logs from running your test? You can get them by running the client with the environment variables GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG.

lwmxiaobei commented 9 months ago

This is the log when the service starts: logs.txt

This is the log from the stress test request: logs (2).txt

I hope this is useful to you

murgatroid99 commented 9 months ago

First, those logs contain interspersed logs from two different processes, specifically with PIDs 62 and 69, which makes it a little harder to understand.

The two logs are snapshots separated by an hour, so I can only guess what happened in between, but the object ID numbers in the logs seem to indicate that both processes reconnected to the server several hundred times in that hour. This is very unusual behavior, so I would recommend double checking your server configuration to see why it might be causing that.

I found a code path that could occasionally leak those handles, but I don't think it accounts for 2000 of them.

lwmxiaobei commented 9 months ago

First, those logs contain interspersed logs from two different processes, specifically with PIDs 62 and 69, which makes it a little harder to understand.

The two logs are snapshots separated by an hour, so I can only guess what happened in between, but the object ID numbers in the logs seem to indicate that both processes reconnected to the server several hundred times in that hour. This is very unusual behavior, so I would recommend double checking your server configuration to see why it might be causing that.

I found a code path that could occasionally leak those handles, but I don't think it accounts for 2000 of them.

Because my service has 2 processes started, you will see the logs of the 2 processes; this log is from a period of time during the stress test, not all logs. How can you tell that have reconnected hundreds of times? Does it mean that the server connection is disconnected, causing the client to reconnect? If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks. I don’t know if I am right. If I am wrong, please correct me. Thank you very much.

My server config options:

 'grpc.keepalive_time_ms': 2000,
 'grpc.keepalive_timeout_ms': 1000,
 'grpc.keepalive_permit_without_calls': 1,
 'grpc.use_local_subchannel_pool': 1,

Also, which code path are you talking about that may be leaked?

murgatroid99 commented 9 months ago

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

murgatroid99 commented 9 months ago

I fixed the bug I mentioned in version 1.9.14, so I recommend using that version for future testing.

lwmxiaobei commented 9 months ago

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

Thank you very much for your answer!

eegiazarov commented 8 months ago

Because my service has 2 processes started, you will see the logs of the 2 processes

You could enable logging on only one of them, or put each log into a different file.

How can you tell that have reconnected hundreds of times?

Each channel, subchannel, and transport has an ID incrementing from 0 from the beginning of the process. Reconnecting involves creating new transports, and sometimes new subchannels. The logs from after an hour show transports with an ID of 530 in one process and 569 in the other process, indicating that new IDs have been generated hundreds of times, corresponding to hundreds of connections, or at least hundreds of connection attempts.

Does it mean that the server connection is disconnected, causing the client to reconnect?

The logs you provided included lines that say "connection closed by GOAWAY with code 0 and data undefined", which indicates that the server deliberately. In fact, the two processes together saw that happen 4 times in the 1 minute that log spans.

If it is reconnected, will it not cause the connection handle to leak, My understanding is that creating a lot of connections will lead to leaks.

The client does all of the relevant connection management. It cleans up connections that are no longer in use, so in general this behavior should not cause any leaks.

My server config options:

'grpc.keepalive_time_ms': 2000,
'grpc.keepalive_timeout_ms': 1000,
'grpc.keepalive_permit_without_calls': 1,
'grpc.use_local_subchannel_pool': 1,

These options have the server send a ping to each connected client every 2 seconds, and it drops the connection if it doesn't get a response within 1 second. That is very aggressive, and it may result in the server frequently closing connections to clients that are just a little slow to respond. The recommended value of the keepalive time is 1 minute (60000 ms), and the default value of the keepalive timeout is 20 seconds (20000 ms). Also, the third and fourth options there have no effect on the server.

Also, which code path are you talking about that may be leaked?

There is a race: if a subchannel's state changes while it is in the process of establishing a connection, the handle for that connection will be leaked. But this should be a rare event.

Thank you very much for your answer!

Hi, did the fix help with leaking memory? I'm experiencing similar problem.

murgatroid99 commented 8 months ago

@eegiazarov There are a lot of details in this issue that may not be relevant to your situation. I recommend opening a new issue with the details of the problem you are experiencing.

eegiazarov commented 8 months ago

@eegiazarov There are a lot of details in this issue that may not be relevant to your situation. I recommend opening a new issue with the details of the problem you are experiencing.

Thanks for your reply. The only thing I can say for sure at the moment is that we're making thousands of connections in a rapid succession using different channels with its immediate destruction (and all the structures that come along) after exchange is made and the library seems to be still leaking memory. Though, there is a chance we're experiencing protocol mismatch between different platforms. And on top of it, we have several unlinked cases of weird behavior... So I'm not quite ready to file an issue. There are too many unknowns yet.

dmitriadn commented 8 months ago

Not sure if directly related to the issue,but we are also having some percentage of cancelled grpc calls. (connection closed by GOAWAY) We are using grpc.max_connection_age_ms set to 60000 to trigger dns resolver refresh on clients. Lesser the max_connection_age_ms value, higher is percentage of cancelled errors. Tried with 1.10.1 version, the issue is still there.

Is it expected behavior? It would not be a problem,but server actually processes the grpc call while giving back call cancelled error.

murgatroid99 commented 8 months ago

The grpc.max_connection_age_ms option is generally paired with the grpc.max_connection_age_grace_ms option. If you set that second option, it limits the maximum amount of time that a preexisting stream can continue after the max connection age timer triggers, and after that streams will be cancelled. If you commonly have streams that run longer than the grpc.max_connection_age_grace_ms setting, then you should expect some cancellations. If you do not commonly have streams run that long, or you do not set that option at all, there may be a bug in the grpc.max_connection_age_ms code.

dmitriadn commented 8 months ago

We dont have max_connection_age_grace_ms option set and only use unary calls that take < 250ms.

Will try max_connection_age_grace_ms to see if it gives anything.

murgatroid99 commented 6 months ago

I recommend trying again with grpc-js version 1.10.4, in case your issue is related to #2625.

wang-xiaowu commented 5 months ago

same situation here, too many CANCELLED: Call cancelled error (in 1.10.8), and this is my configuration

murgatroid99 commented 5 months ago

The timeout: '1s' configuration is probably not helping. That's forcing every call to end after 1 second, which will make the retry policy almost never apply.

wang-xiaowu commented 5 months ago

The timeout: '1s' configuration is probably not helping. That's forcing every call to end after 1 second, which will make the retry policy almost never apply.

actually i just wanna it retry in ['UNAVAILABLE', 'CANCELLED'] when it send GOAWAY status to the client by setting max_connection_age_ms

wang-xiaowu commented 5 months ago

the timeout configuration is just for making sure it can be forced ending

wang-xiaowu commented 5 months ago

this is client's debug log. i tried to decrease the max_connect_time param,

  // 'grpc.max_connection_age_ms': 1_000,
  // 'grpc.max_connection_age_grace_ms': 1_000,

at 08:28:48.220, the client was closed by goaway at 08:28:48.223, the client reconnect successfully again. if the concurrency number is large, i guess there must be some connection was lost in the traffic right?

D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | transport | (186) 127.0.0.1:34049 connection closed by GOAWAY with code 0 and data max_age
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | subchannel | (1) 127.0.0.1:34049 READY -> IDLE
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | subchannel_refcount | (1) 127.0.0.1:34049 refcount 2 -> 1
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | pick_first | READY -> IDLE
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | round_robin | READY -> TRANSIENT_FAILURE
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | resolving_load_balancer | dns:localhost:34049 READY -> TRANSIENT_FAILURE
D 2024-05-22T08:28:48.220Z | v1.10.7 17820 | connectivity_state | (1) dns:localhost:34049 READY -> TRANSIENT_FAILURE
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | subchannel_refcount | (1) 127.0.0.1:34049 refcount 1 -> 2
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | pick_first | Start connecting to subchannel with address 127.0.0.1:34049
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | pick_first | IDLE -> CONNECTING
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | round_robin | TRANSIENT_FAILURE -> CONNECTING
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | resolving_load_balancer | dns:localhost:34049 TRANSIENT_FAILURE -> CONNECTING
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | connectivity_state | (1) dns:localhost:34049 TRANSIENT_FAILURE -> CONNECTING
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | dns_resolver | resolution update delayed by "min time between resolutions" rate limit
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | subchannel | (1) 127.0.0.1:34049 IDLE -> CONNECTING
D 2024-05-22T08:28:48.221Z | v1.10.7 17820 | transport | dns:localhost:34049 creating HTTP/2 session to 127.0.0.1:34049
D 2024-05-22T08:28:48.222Z | v1.10.7 17820 | transport | (186) 127.0.0.1:34049 session closed
D 2024-05-22T08:28:48.222Z | v1.10.7 17820 | subchannel | (1) 127.0.0.1:34049 CONNECTING -> READY
D 2024-05-22T08:28:48.222Z | v1.10.7 17820 | pick_first | Pick subchannel with address 127.0.0.1:34049
D 2024-05-22T08:28:48.222Z | v1.10.7 17820 | subchannel_refcount | (1) 127.0.0.1:34049 refcount 2 -> 3
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | subchannel_refcount | (1) 127.0.0.1:34049 refcount 3 -> 2
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | pick_first | CONNECTING -> READY
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | round_robin | CONNECTING -> READY
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | resolving_load_balancer | dns:localhost:34049 CONNECTING -> READY
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | connectivity_state | (1) dns:localhost:34049 CONNECTING -> READY
D 2024-05-22T08:28:48.223Z | v1.10.7 17820 | transport | (187) 127.0.0.1:34049 local settings acknowledged by remote: {"headerTableSize":4096,"enablePush":true,"initialWindowSize":65535,"maxFrameSize":16384,"maxConcurrentStreams":4294967295,"maxHeaderListSize":4294967295,"maxHeaderSize":4294967295,"enableConnectProtocol":false}
D 2024-05-22T08:28:49.324Z | v1.10.7 17820 | transport | (187) 127.0.0.1:34049 connection closed by GOAWAY with code 0 and data max_age
murgatroid99 commented 5 months ago

That setting makes the server drop the connection after a second. The client log shows that the connection is dropped after a second. The behavior is as expected.

To be more specific about the timeout and retry configuration: the "1s" timeout means that the overall request will end after 1 second. Your retry configuration sets a minimum time between attempts of "0.5s". This means that if you start a request and it fails immediately, the next attempt with start after 0.5 seconds. Then if that attempt fails, the next attempt will start more than 0.5 seconds after that. However, by that time 1 second has passed total, so the entire call ends with a deadline exceeded error. With that configuration, it can never make more than 2 attempts before timing out, making the maxAttempts: 5 setting meaningless. If you want to be able to make 5 attempts, you need to set a timeout long enough to allow for 5 attempts to occur.

wang-xiaowu commented 5 months ago

thanks for your explanation of the retry policy configuration, i will figure it out. and as the behavior of the connection drops, it's not up to my expectations in fact. I'm using this in k8s, and i wanna make sure client can aware of new server replicas, that's why i set the max_connect_xxx param. so is my configuration wrong or there is another option can be more perfect?

wang-xiaowu commented 5 months ago

i misunderstood you, so you mean the time intervals between the connection drops and the connection reconnect is correct, only I can do to make it up is just make the retry configuration more rational

murgatroid99 commented 5 months ago

OK, I'm going to go over some things to make sure we're on the same page: gRPC establishes connections, and then generally makes multiple requests using that single connection. The grpc.max_connection_age_ms option instructs the server tell the client to stop using each connection that much time (in milliseconds) after it is established. Setting this to a small number causes a lot of connection churn. The grpc.max_connection_age_grace_ms option instructs the server to cancel every request that is still open that much time (in milliseconds) after the connection is dropped. Setting this to a smaller number than requests commonly take can result in a lot of cancellations.

The retryPolicy and timeout configurations in the service config have nothing to do with connection management. The retryPolicy controls how a request is retried, and the timeout policy makes every request fail after that much time.

The client does reconnect immediately after after a connection is dropped. I don't see why that would be a problem.

wang-xiaowu commented 5 months ago

the result report of the pressure testing is in a highly rate of anomaly, and client shows too many cancel error. and as what you said, i configured max_connection_age_ms: 1_000 and max_connection_age_grace_ms: 2_000 in server, no timeout, and server method only has a procedure with 10-15ms[less than max_connection_age_grace_ms].

from the debugging's log i showed, there are 3 millisecond[maybe more sometimes] between the connection drops and reconnect indeed, so i thought maybe there are some connections lost in this duration. by using retry policy, i hope it can retry maxAttempts times after encounter the CANCELLED error, from the debug log, it did, but not completely right.

here's the server and client's code, i appreciate you to take a look. the client test url is : curl --location --request POST 'http://localhost:33013/api/test', it will instruct client to the server.

client.zip server.zip