grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.29k stars 3.8k forks source link

`PickFirstLeafLoadBalancer` does not emit `TRANSIENT_FAILURE` states #11082

Open raboof opened 3 months ago

raboof commented 3 months ago

When using channel.notifyWhenStateChanged and trying to connect to addresses that don't accept connections, the PickFirstLoadBalancer emits TRANSIENT_FAILURE states, while the PickFirstLeafLoadBalancer just stays in CONNECTING.

What version of gRPC-Java are you using?

I noticed this after updating from 1.62.2 to 1.63.0. I can also reproduce it on 1.62.2 when I set the GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST environment variable to true (which has become the default in 1.63.0).

What is your environment?

Linux (NixOS unstable), Oracle Java 1.8.0_362

What did you expect to see?

Alternating between CONNECTING and TRANSIENT_FAILURE states

What did you see instead?

Silence after entering the CONNECTING state

Steps to reproduce the bug

I don't have a particularly minimal reproducer, but can reliably show the problem with the "NonBalancingIntegrationSpecNetty" test in Akka gRPC (https://github.com/apache/pekko-grpc/pull/271#issuecomment-2041152794)

temawi commented 3 months ago

@larry-safran You might be best to investigate this one.

YifeiZhuang commented 3 months ago

I can not reproduce it yet in the unit test. The existing UT covers some basic situations for the channel to report TRANSIENT_FAILURE after CONNECTING, e.g. with multiple and one address, initial iteration complete for all the addresses https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/PickFirstLeafLoadBalancerTest.java#L333 https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/PickFirstLeafLoadBalancerTest.java#L531

raboof commented 3 months ago

Thank you for looking into this, much appreciated - I'll see if I can find the time to dig into what's different between those grpc-java UT's and the behavior I'm seeing in the pekko-grpc integration test. Might have to wait a couple of weeks, though - busy period here.

larry-safran commented 3 months ago

Are you providing multiple addresses or only a single address? Is it only in CONNECTING for a limited time or remains that way permanently?

raboof commented 3 months ago

I was providing 2 addresses (that IIRC both would respond with 'connection refused'). It seemed to remain in CONNECTING.

On 15 April 2024 23:12:13 WEST, Larry Safran @.***> wrote:

Are you providing multiple addresses or only a single address? Is it only in CONNECTING for a limited time or remains that way permanently?

-- Reply to this email directly or view it on GitHub: https://github.com/grpc/grpc-java/issues/11082#issuecomment-2057899202 You are receiving this because you authored the thread.

Message ID: @.***>

raboof commented 2 months ago

I created a possible unit test reproducer in https://github.com/grpc/grpc-java/compare/master...raboof:grpc-java:test-for-PickFirstLeafLoadBalancer-11082?expand=1 - I'm not too familiar with the grpc-java codebase, so it is possible that I'm misunderstanding something and not accurately reproducing the issue, but it might be a good starting point for further analysis. The behaviour does look similar to what I'm seeing in the pekko-grpc failure, where isPassComplete keeps returning false (because addressIndex.isValid() remains true).

pjfanning commented 1 month ago

1.64.0 seems to work fin in the Apache Pekko gRPC tests. - https://github.com/apache/pekko-grpc/pull/311

raboof commented 1 month ago

I'm OK with closing this as the Pekko gRPC tests suggests the problem is gone on 1.64. My reproducer at https://github.com/grpc/grpc-java/compare/master...raboof:grpc-java:test-for-PickFirstLeafLoadBalancer-11082?expand=1 still fails when rebased, but I guess that's likely a problem with the test rather than with the implementation.

ejona86 commented 1 month ago

We'd expect 1.64 would be fixed, because we disabled PickFirstLeafLoadBalancer (by default). But we'll be turning it on again in the future, so this is appropriate to keep open and us address.

raboof commented 1 month ago

We'd expect 1.64 would be fixed, because we disabled PickFirstLeafLoadBalancer (by default).

Ah :smile:

But we'll be turning it on again in the future, so this is appropriate to keep open and us address.

Makes sense, LMK if you want me to test anything!

larry-safran commented 1 month ago

@raboof Looking at your test code for reproducing, you are sending the subchannel state twice to the same subchannel which is why isPassComplete keeps returning false. Changing the logic for the second onSubchannelState call makes the test case pass.

    inOrder.verify(mockSubchannel2).start(stateListenerCaptor.capture());
    SubchannelStateListener stateListener2 = stateListenerCaptor.getValue();
    stateListener2.onSubchannelState(ConnectivityStateInfo.forTransientFailure(CONNECTION_ERROR));

Is it possible that the pacheko test that was failing isn't actually providing a failure on the second subchannel?

raboof commented 1 month ago

ah, sorry for botching the reproducer ;)

Is it possible that the pekko test that was failing isn't actually providing a failure on the second subchannel?

hmm, that seems unlikely, as I'm not failing subchannels explicitly there (just passing a service with two invalid addresses).

larry-safran commented 1 month ago

@raboof There was another change that went into 1.64 related to load balancers that might have also had an impact on the problem. Could you try 1.64 with GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=true and see if the problem is still there?

raboof commented 4 weeks ago

OK, I can confirm our test still fails on 1.64 with GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=true, but there seems to be something else going on than I assumed earlier:

In the pekko test we're passing a list of two EquivalentAddressGroups to io.grpc.NameResolver.Listener#onAddresses that both contain the same (invalid) address, ending up with a single subchannel but an addressIndex of size 2. Indeed when we pass two different invalid addresses it works fine.

Is that invalid input or something you'd like to support/handle?