Open manuraj17 opened 2 months ago
@manuraj17 I was also able to repro on my local with localhost
. It looks like when using localhost
with grpc.NewClient
address resolution takes more than 5s which is why we get Deadline exceeded
in the example. If you try with 10s timeout, rpc will succeed.
Will discuss with team and update on what is the best course of action here. Thanks once again for catching this.
@purnesh42H Thanks for checking on this. Will wait for the update 👍🏽
Client logs with 1 second timeout when using localhost
with grpc.NewClient
2024/07/23 21:08:08 INFO: [core] original dial target is: "localhost:50051"
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel created
2024/07/23 21:08:08 INFO: [core] [Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/localhost:50051", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel authority set to "localhost:50051"
2024/07/23 21:08:08 INFO: [core] [Channel #1]Channel exiting idle mode
2024/07/23 21:08:13 could not greet: rpc error: code = DeadlineExceeded desc = context deadline exceeded
exit status 1
@purnesh42H Could share what changed between from being able to reproduce to now? Just curious on what happened earlier and what is happening now.
@manuraj17 the problem seems to be when using grpc.NewClient
. The address resolution is taking longer when using localhost
and that's why you experience timeout. If you try the same example with 10s timeout, rpc will succeed. We are still debugging the root cause.
this call is blocked for longer when using localhost
https://github.com/grpc/grpc-go/blob/2bcbcab9fbeec8a475631e8f06bd0cb45eb92dc8/stream.go#L212
@manuraj17 I am not able to repro this anymore and localhost
seems to work with existing example. For now, we will keep this open but won't investigate further since its not a widespread issue. Feel free to add any new information here and we can re-evaluate the priority
Are you still able to reproduce this? Can you run with full debug logging enabled and include the logs?
@dfawley Will check and update; Though any idea on what is happening in this scenario? @purnesh42H was able to identify and figure it's a resolution issue. Curious as to what information you folks have regarding this.
@dfawley Will check and update; Though any idea on what is happening in this scenario?
No idea. If your machine can't resolve localhost
, or properly connect to an address it returns, then that sounds like a configuration issue to me.
No idea. If your machine can't resolve localhost, or properly connect to an address it returns, then that sounds like a configuration issue to me.
If it's my machine issue wouldn't it fail for
Dial
methodNot sure what what you are on to here coz it still doesn't clariy how @purnesh42H was able to reproduce it for a while; did he have some configuration issue?
@manuraj17 can you pull the latest and then try running with grpc.NewClient
and localhost
with the logging turned on https://github.com/grpc/grpc-go#how-to-turn-on-logging? Since, this is not always reproducible, we need more information to verify if its a library or configuration issue.
To clarify, even for me the resolution was just taking a bit longer with localhost
but that's not happening anymore
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
@purnesh42H
can you pull the latest
Are you suggesting the latest from the grpc main branch?
Are you suggesting the latest from the grpc main branch?
Yes
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
We ran into this issue running code using the latest grpc-go release as of 8 August 2024. While I unfortunately cannot provide any logs or code, I can say we only started seeing the issue when switching from using Dial
to NewClient
, as previously commented, and only when all nameservers listed in /etc/resolv.conf
were unreachable. (Figuring this part out took a while!)
Similar to the reporter, changing from localhost to 127.0.0.1 for the server and client also avoided the error, even with all invalid/unreachable nameservers in /etc/resolv.conf
.
(Note: while the nameservers in /etc/resolv.conf
were invalid, we could still e.g. ping localhost
and get a response; it was only grpc-go that seemed to suffer any issues (beyond what you'd naturally expect when your DNS config is awry, anyway).)
(Also, localhost was listed in /etc/hosts similar to the reporter.)
Apologies for the double post, forgot the @purnesh42H and I'm not sure editing mentions into comments actually works lol
Update: curiously, the problem seems to stop manifesting immediately when the machine is physically disconnected from all networks. Perhaps somewhere there's code that's causing this DNS record lookup to get skipped if all network interfaces are down, allowing the localhost entry in /etc/hosts to take over and the health check to proceed?
This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.
In new grpc.NewClient
the default name resolution is dns
as opposed to passthrough
in deprecated grpc.Dial
. Try NewClient
with passthrough
as the resolver and it should behave the same.
This type of issue with dns resolution is more likely related to system configuration rather than a problem with the gRPC library itself as it relies on the underlying system for DNS resolution. If the system takes time to resolve localhost, this delay will be reflected in gRPC client behavior as well.
We are seeing the same issue, and indeed using passthrough
works. But even with passthrough net.Dial
will eventually perform the DNS resolution, and this seemingly works just fine, so I think there is an open question about what the grpc-go DNS resolver does differently to provoke this issue (both use net.Resolver
eventually).
I'm not sure it should just be written off to system misconfiguration. Even if that was a factor, the people encountering this seem to have no other issues with DNS resolution (we don't) and are slowed down while trying to switch to NewClient
with the recommended default resolution scheme. It'd be great to at least find a more satisfying explanation.
I faced this issue when I was testing out a simple application; Not able to put a finger on what exactly am missing.
My client code is returning with this error
I have my server running as
Client connecting to
localhost
asThis code fails.
But when I connect with either of
It works. I am trying to understand what am missing here. Can anyone help me understand? TIA!
Full application code available here
UPDATE (More Info):
Some more information Version
I am able to connect if I am using
127.0.0.0
and[::1]
The issue specifically seems to occur whenlocalhost
is used.etc/hosts
is belowI am running the server at
add = :50051
and initialising likeIt works with Dial
When I use
Dial
it worksI have tried all variations to connect without using Dial, and everything except
localhost
works. The resolution of localhost also seems to be working correctly. I get the following logsNOTE: IPV6 gets resolved first.
Though I am not able to figure what failed here. I will have end up not using localhost for now. Though, curious what is happening here.