grpc / grpc-go

The Go language implementation of gRPC. HTTP/2 based RPC
https://grpc.io
Apache License 2.0
20.94k stars 4.34k forks source link

1.42.0 regression in XDS #4924

Closed howardjohn closed 2 years ago

howardjohn commented 2 years ago

What version of gRPC are you using?

v1.42.0

What version of Go are you using (go version)?

1.17

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Ran Istio tests against our proxyless gRPC implementation.

What did you expect to see?

Behavior is the same as 1.41 (or really v1.42.0-dev.0.20211015201449-4757d0249e2d which we were on previously)

What did you see instead?

Regression in functionality caused by https://github.com/grpc/grpc-go/pull/4909/files. We verified setting GRPC_XDS_EXPERIMENTAL_RBAC=false resolves the issue; however an environment variable is not feasible for us since we cannot configure it on the control plane side.

The root cause seems to be some issues in the route processing.

Our HCMs do not set a RouteSpecifier: https://github.com/istio/istio/blob/c017654753cd8f4254abbe0567ff9c728d2034d2/pilot/pkg/networking/grpcgen/lds.go#L156

I think routeNamesToWatch is empty because of this. So handleRouteUpdate is called, so rdsUpdates is never set and route configuration pointer is nil is triggered is triggered. The code comments indicate this should never happen.

It seems like we can fix this by adding:

                    RouteSpecifier: &hcm.HttpConnectionManager_RouteConfig{
                        RouteConfig: &route.RouteConfiguration{
                            Name: "inbound",
                            VirtualHosts: []*route.VirtualHost{{
                                Domains: []string{"*"},
                                Routes: []*route.Route{{
                                    Match: &route.RouteMatch{
                                        PathSpecifier: &route.RouteMatch_Prefix{Prefix: "/"},
                                    },
                                    Action: &route.Route_NonForwardingAction{},
                                }}}}},
                    },

but I don't think we should be required to add this when it was not previously.

cc @zasweq @stevenctl

zasweq commented 2 years ago

Hey, I messaged @ejona86 who is leading the RBAC effort. My code is faulty in its current state because a nil route specifier would lead to the error you see. Thus, I wanted to know if a nil route specifier was even allowed. According to the proto documentation, it is a required field: https://github.com/envoyproxy/envoy/blob/56e8c45b1b340c4a4f8f02ec2488354c31806d59/api/envoy/extensions/filters/network/http_connection_manager/v3/http_connection_manager.proto#L317. Thus, I will switch the behavior of a nil route specifier on server side to the same as it was on client side, and NACK any resource that comes in server side with a nil route specifier.

howardjohn commented 2 years ago

It seems to require Route_NonForwardingAction which wasn't even added to Envoy until 1.19 (3 mo ago). We shouldn't expect the control plane is always newer than clients and can keep up with all of the changing requirements

zasweq commented 2 years ago

In regards to the change for the configuration server side needed (aka empty RDS, unlike previously where empty route configuration would lead to RPC's proceeding as normal on server side): "If they don't have RouteConfiguration, then all RPCs fail, but the config is ACKed That mainly happens when using RDS and RDS fails."

ejona86 commented 2 years ago

I'm confused by this issue. It sounds like you are surprised. We spoke specifically about this issue on Friday with you in the internal email thread.

CC @lidizheng

howardjohn commented 2 years ago

That one was about empty HTTP filter list, which broke us + was fixed a few weeks ago (since we were tracking the master branch). Very similar issue though, certainly.

This one needs something on gRPC side I think - now looking into it seems like a more explicit NACK (https://github.com/grpc/grpc-go/pull/4925) may be appropriate, but currently we hit code that says "this should never happen". We can/will also update the route config we send

zasweq commented 2 years ago

Explicit NACK, and then some way for users to know that with this change RouteConfiguration needs to be valid and incoming RPCs need to match to type NFA on Server Side now that RBAC HTTP Filter is merged seems reasonable to me.

ejona86 commented 2 years ago

That one was about empty HTTP filter list

Ah. You're right. Yeah, both are the Router/L7 plumbing.

was fixed a few weeks ago (since we were tracking the master branch)

Huh. I wonder why this wasn't caught sooner then. I'd have expected them to be noticed around the same time.

now looking into it seems like a more explicit NACK (#4925) may be appropriate

Yeah, @zasweq mentioned above it should NACK but isn't:

Thus, I will switch the behavior of a nil route specifier on server side to the same as it was on client side, and NACK any resource that comes in server side with a nil route specifier.

dfawley commented 2 years ago

This should be addressed by #4925