grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.48k stars 3.86k forks source link

Getting feedback on errors while using waitForReady or enableRetry #6569

Open emaayan opened 4 years ago

emaayan commented 4 years ago

Is your feature request related to a problem?

yes using the connection retry and/or , rpc retry blocks all feedback from coming back to the user.

Describe the solution you'd like

i'd like to have an onError callback either for the context/stream/channel that i'd be able to report back to the user, i'm aware that while the error is being reported a race condition may occur that will no longer make it valid for that time, but the user should have the option to decide if they should respond or do something with it.

since GRPC's logging is based on java logging, it cannot be integrated with the standard application logging framework, so even the errors themselves aren't always visible at all.

this is problematic when you situations where failure in services are being reported to a main dashboard or SLA team, the retry logic should continue to execute, while a reporting mechanism should be in place to send feedback in real time if the failures have stopped happening.

Describe alternatives you've considered

attempting to use getState(false) or notifyStateChanged proved to be unreliable notifyStateChange doesn't tell why it has happened, and getState is too transient to be called periodically.

attempting to subclass netty proved to be too sensitive and will brake in future versions:

NettyChannelBuilder.forAddress("",0)
       .withOption(ChannelOption.CONNECT_TIMEOUT_MILLIS, 50)
       .channelFactory(new ChannelFactory<io.netty.channel.Channel>() {
           @override
           public io.netty.channel.Channel newChannel() {
               final NioSocketChannel nioSocketChannel = new
NioSocketChannel() {
                   @override
                   public ChannelFuture connect(SocketAddress remoteAddress) {
                       final ChannelFuture connect =
super.connect(remoteAddress);

connect.addListener((GenericFutureListener<ChannelFuture>)
channelFuture -> {
                           final Throwable cause = channelFuture.cause();
                           if (cause != null) {
                               cause.printStackTrace();
                           }
                       });
                       return connect;
                   }
               };

               return nioSocketChannel;

           }
       })
ericgribkoff commented 4 years ago

Can you elaborate on why getState and notifyStateChanged are unsuited for your use case, and what makes them unreliable versus the proposed error callback? Your proposed error callback also has the race condition, as you describe, in that the error given may no longer be valid when you actually receive it, so I don't see why this would be useful beyond our existing API.

I can believe that we don't have an explicit signal from wait-for-ready RPCs that lets you distinguish network errors, but I'm also not clear on why you need that or what decision you would make based on this - you can retrieve that information out-of-band through the existing getState and notifyStateChanged methods.

emaayan commented 4 years ago

the onError callback on the streamObserver gives me the full underlying error with the specific exception and it's stack trace, so if for example ConnectionException or ConnectTimeoutException would be very clear, it's also consistent and a lot simpler to use .. notifyStateChanged doesn't really tell me what exactly happened it also one-off which means i have re-trigger it not only that, if the docs says it's triggered only when diverges from the given state, the logs here don't show me any transient_failure

[image: image.png]

in order to achieve the same functionality i wanted, i had to dig deep inside netty, to subclass socketFacotry and do this: which won't work starting from grpc 1.22+ and seems rather risky.. . it shouldn't be that hard.

NettyChannelBuilder.forAddress("",0) .withOption(ChannelOption.CONNECT_TIMEOUT_MILLIS, 50) .channelFactory(new ChannelFactory() { @Override public io.netty.channel.Channel newChannel() { final NioSocketChannel nioSocketChannel = new NioSocketChannel() { @Override public ChannelFuture connect(SocketAddress remoteAddress) { final ChannelFuture connect = super.connect(remoteAddress);

connect.addListener((GenericFutureListener) channelFuture -> { final Throwable cause = channelFuture.cause(); if (cause != null) { cause.printStackTrace(); } }); return connect; } };

           return nioSocketChannel;

       }
   })

On Thu, Dec 26, 2019 at 11:02 PM Eric Gribkoff notifications@github.com wrote:

Can you elaborate on why getState and notifyStateChanged are unsuited for your use case, and what makes them unreliable versus the proposed error callback? Your proposed error callback also has the race condition, as you describe, in that the error given may no longer be valid when you actually receive it, so I don't see why this would be useful beyond our existing API.

I can believe that we don't have an explicit signal from wait-for-ready RPCs that lets you distinguish network errors, but I'm also not clear on why you need that or what decision you would make based on this - you can retrieve that information out-of-band through the existing getState and notifyStateChanged methods.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-java/issues/6569?email_source=notifications&email_token=ADGP5MD5JMKXA76ZCD2Q2ITQ2ULWBA5CNFSM4J6S2KG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHWEQQA#issuecomment-569133120, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGP5MDXOF5MDBEKULGOIGDQ2ULWBANCNFSM4J6S2KGQ .

ejona86 commented 4 years ago

I've read this issue and the related mailing list thread. But I don't think I understand.

An RPC can only fail once; we can't call onError twice. You have to decide whether you want the notification via the RPC or not. It sounds like you are happy with the call to onError, except that the RPC then fails. We would normally suggest retrying to with exponential backoff. Why does that not work?

Wait-for-ready RPCs try to ignore connection failures, so their entire point is to not provide information about connection failures. I don't see any reasonable way to export the failure information from a wait-for-ready RPC while the RPC is still alive.

not only that, if the docs says it's triggered only when diverges from the given state, the logs here don't show me any transient_failure

I think this is because the channel immediately transitions to idle or connecting. The moment it enters TRANSIENT_FAILURE the exponential backoff time is up because the backoff time is measured from the start of the connection attempt. This is working as designed, although may not be as ideal for you. Two of us have considered tweaking how the states work, but it would be a lot of work and we don't have the time nor the consensus yet.

Even though the state has transitioned to CONNECTING or IDLE, conceivably we could keep the "last failure" and provide an API to get at it. So it could be "okay" that the connection state API doesn't give the state you hope for. But the bigger issue is that "last failure" doesn't actually mean much from the channel's perspective. The channel uses a load balancing policy that commonly is trying to connect to multiple backends. The LB policy is the only thing that knows error details, but it only exposes that on specific RPC failures. So exposing more information here would mean changing LB policies and APIs.

I'll note that grpc-java v1.23.0 added additional debug information to the status when DEADLINE_EXCEEDED is triggered. I think in your case it would just include "waiting_for_connection," but it could probably be enhanced reasonably easily to include the last connection failure it saw. Since this is at the RPC-level, the failure information is already returned by the LB policy, so it is mainly a matter of saving it. But that doesn't seem like a solution for you since 1) the RPC would still fail, 2) you wouldn't know about the error when it happened, instead waiting for the RPC to timeout.

emaayan commented 4 years ago

this is same the problem with enableRetry i asked about mailing list thread, and that's why i meant that from my perspective it looks the same thing

i want to re-use the logic behind enableRetries or the the one behind the connection attempts , i wouldn't want to try and implement an exponential backOff or something similar like FailSafe Framework just to get back the exception. there's no reason for that if all i want is to display to the user any current problems that are happening during the connection,

actually it would even be better to have enableRetry (like for example enableRetry(StatusListener) ) because then i would get all the errors and not just the connection ones. this would aid for example if statistic about SLA or dashboard features while still re-using gprc's own retry

the DEADLINE_EXCEEDED is less of an issue because i misinterpreted it, i thought it's the same thing as SocketReadTimeoutException where you finally get a socket but can't read anything from it we are using a never ending Server stream rpc and we established that upon connection we should get a like a protobuff signal from the server, i thought that using DEADLINE_EXCEEDED would act like that timeout in case we wont' get that signal. but i was wrong.

so now i'm currently it as one off blocking sync call to check we getting anything, but not in the real async call

On Fri, Dec 27, 2019 at 9:10 PM Eric Anderson notifications@github.com wrote:

I've read this issue and the related mailing list thread https://groups.google.com/d/topic/grpc-io/-d72u547sTY/discussion. But I don't think I understand.

An RPC can only fail once; we can't call onError twice. You have to decide whether you want the notification via the RPC or not. It sounds like you are happy with the call to onError, except that the RPC then fails. We would normally suggest retrying to with exponential backoff. Why does that not work?

Wait-for-ready RPCs try to ignore connection failures, so their entire point is to not provide information about connection failures. I don't see any reasonable way to export the failure information from a wait-for-ready RPC while the RPC is still alive.

not only that, if the docs says it's triggered only when diverges from the given state, the logs here don't show me any transient_failure

I think this is because the channel immediately transitions to idle or connecting. The moment it enters TRANSIENT_FAILURE the exponential backoff time is up because the backoff time is measured from the start of the connection attempt. This is working as designed, although may not be as ideal for you. Two of us have considered tweaking how the states work, but it would be a lot of work and we don't have the time nor the consensus yet.

Even though the state has transitioned to CONNECTING or IDLE, conceivably we could keep the "last failure" and provide an API to get at it. So it could be "okay" that the connection state API doesn't give the state you hope for. But the bigger issue is that "last failure" doesn't actually mean much from the channel's perspective. The channel uses a load balancing policy that commonly is trying to connect to multiple backends. The LB policy is the only thing that knows error details, but it only exposes that on specific RPC failures. So exposing more information here would mean changing LB policies and APIs.

I'll note that grpc-java v1.23.0 added additional debug information to the status when DEADLINE_EXCEEDED is triggered. I think in your case it would just include "waiting_for_connection," but it could probably be enhanced reasonably easily to include the last connection failure it saw. Since this is at the RPC-level, the failure information is already returned by the LB policy, so it is mainly a matter of saving it. But that doesn't seem like a solution for you since 1) the RPC would still fail, 2) you wouldn't know about the error when it happened, instead waiting for the RPC to timeout.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grpc/grpc-java/issues/6569?email_source=notifications&email_token=ADGP5MHQBAS75O423HOOEDLQ2ZHLXA5CNFSM4J6S2KG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHXUGBQ#issuecomment-569328390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGP5MDNEFOCZHJU3PLDLA3Q2ZHLXANCNFSM4J6S2KGQ .

ejona86 commented 4 years ago

Referenced enableRetry discussion.