linkerd / linkerd2

Ultralight, security-first service mesh for Kubernetes. Main repo for Linkerd 2.x.
https://linkerd.io
Apache License 2.0
10.61k stars 1.27k forks source link

linkerd-proxy panics when retrying wire-grpc requests #11529

Open Hexcles opened 11 months ago

Hexcles commented 11 months ago

What is the issue?

We saw elevated client errors after enabling retries for some GRPC routes in our service profile. Linkerd metrics show inbound requests are a lot higher than outbound requests for this route. After looking around, we found panics in the logs of linkerd-proxy on the client side.

How can it be reproduced?

(We are trying to produce a minimal, open-source case. FWIW, we use https://square.github.io/wire/wire_grpc/ instead of the standard GRPC.)

Logs, error output, etc

thread 'main' panicked at 'if our `state` was `None`, the shared state must be `Some`', /__w/linkerd2-proxy/linkerd2-proxy/linkerd/http-retry/src/replay.rs:152:22

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-10-25T10:38:18Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-10-25T10:38:27Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-25T10:38:38Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-25T18:50:02Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-viz
-----------
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2023-10-26T00:25:04Z
    see https://linkerd.io/2.14/checks/#l5d-tap-cert-not-expiring-soon for hints

Status check results are √

Environment

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

Hexcles commented 11 months ago

https://github.com/linkerd/linkerd2-proxy/blob/986d45895c0945152828e1286f7b5714520a86ba/linkerd/http-retry/src/replay.rs#L152

Hexcles commented 10 months ago

Some debug logging:

{"timestamp":"[  1269.373381s]","level":"DEBUG","fields":{"message":"client connection open"},"target":"linkerd_transport_metrics::client","spans":[{"name":"inbound"},{"port":80,"name":"server"},{"name":"backend-web.default.svc.cluster.local:80","name":"http"},{"name":"profile"},{"name":"http1"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.375560s]","level":"DEBUG","fields":{"state":"Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f1dd886c700, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 2, rx_fields: \"...\" } } } })"},"target":"linkerd_proxy_http::classify::channel","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"},{"name":"sessions-web","ns":"default","port":"80","name":"service"},{"addr":"172.17.80.145:8080","name":"endpoint"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.375597s]","level":"DEBUG","fields":{"method":"POST","uri":"http://sessions-web/com.session.Sessions/WhoisByCookie","version":"HTTP/2.0"},"target":"linkerd_proxy_http::client","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"},{"name":"sessions-web","ns":"default","port":"80","name":"service"},{"addr":"172.17.80.145:8080","name":"endpoint"},{"name":"h2"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.375605s]","level":"DEBUG","fields":{"headers":"{\"te\": \"trailers\", \"grpc-trace-bin\": \"\", \"grpc-accept-encoding\": \"gzip\", \"grpc-encoding\": \"gzip\", \"x-datadog-trace-id\": \"4009838577945735206\", \"x-datadog-parent-id\": \"6986014011649582376\", \"x-datadog-sampling-priority\": \"-1\", \"x-datadog-tags\": \"_dd.p.dm=-3\", \"traceparent\": \"00-000000000000000037a5cef10d1cf026-60f34ebae8466528-00\", \"tracestate\": \"dd=t.dm:-3\", \"rop\": \"803a8303e5668f0e058c2080c10c222d\", \"ropt\": \"http.handler\", \"pop\": \"803a8303e5668f0e058c2080c10c222d\", \"popt\": \"http.handler\", \"grpc-timeout\": \"29999m\", \"x-if-wsat\": \"<1KB of secrets>\", \"user-agent\": \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36\", \"content-type\": \"application/grpc\", \"accept-encoding\": \"gzip\", \"l5d-dst-canonical\": \"sessions-web.default.svc.cluster.local:80\"}"},"target":"linkerd_proxy_http::client","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"},{"name":"sessions-web","ns":"default","port":"80","name":"service"},{"addr":"172.17.80.145:8080","name":"endpoint"},{"name":"h2"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.385088s]","level":"DEBUG","fields":{"message":"Remote proxy error"},"target":"linkerd_app_outbound::http::handle_proxy_error_headers","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"},{"name":"sessions-web","ns":"default","port":"80","name":"service"},{"addr":"172.17.80.145:8080","name":"endpoint"}],"threadId":"ThreadId(1)"}
thread 'main' panicked at 'if our `state` was `None`, the shared state must be `Some`', /__w/linkerd2-proxy/linkerd2-proxy/linkerd/http-retry/src/replay.rs:152:22
{"timestamp":"[  1269.385179s]","level":"DEBUG","fields":{"message":"dropping ResponseBody"},"target":"linkerd_proxy_http::classify::channel","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.385191s]","level":"DEBUG","fields":{"message":"sending EOS to classify"},"target":"linkerd_proxy_http::classify::channel","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.385631s]","level":"DEBUG","fields":{"message":"The client is shutting down the connection","res":"Ok(())"},"target":"linkerd_proxy_http::server","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"},{"addr":"10.100.169.20:80","name":"proxy"},{"name":"http"}],"threadId":"ThreadId(1)"}
{"timestamp":"[  1269.385671s]","level":"DEBUG","fields":{"message":"Connection closed"},"target":"linkerd_app_core::serve","spans":[{"name":"outbound"},{"client.addr":"172.17.75.208:58594","server.addr":"10.100.169.20:80","name":"accept"}],"threadId":"ThreadId(1)"}
Hexcles commented 10 months ago

@hawkw do you know how to enable RUST_BACKTRACE in linkerd-proxy?

wmorgan commented 9 months ago

@Hexcles were you able to create a repro for this?

stale[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Hexcles commented 2 months ago

Still happening. The panic site has been moved though:

https://github.com/linkerd/linkerd2-proxy/blob/837fbc9531844e5f10d7f4480555127236e6a09b/linkerd/http/retry/src/replay.rs#L152

Working on a repro

Hexcles commented 2 months ago

OK here's my complete repro:

https://github.com/Hexcles/wire/blob/grpc-sample/samples/wire-grpc-sample/k8s.yaml

  1. Create a new k8s cluster (I used kind)
  2. Install linkerd (I used linkerd CLI)
  3. kubectl apply -f k8s.yaml
  4. Wait for the pods to become ready and observe linkerd logs in the client pod: you'll soon see a panic (within a minute)
olix0r commented 2 months ago

I notice that your proto is:

service Whiteboard {
  rpc Whiteboard (stream WhiteboardCommand) returns (stream WhiteboardUpdate) {
  }

  rpc Echo (Point) returns (Point) {
  }
}

Are you exercising both RPCs in this scenario?

Hexcles commented 2 months ago

Nope, only the Echo. I didn't test the streaming version actually. I added the unary call for a simpler repro.

So here's the server-side code exercised:

https://github.com/Hexcles/wire/blob/fa9f1e2b7d16fc2364a62b45381d42dd9323a439/samples/wire-grpc-sample/server/src/main/java/com/squareup/wire/whiteboard/WhiteboardGrpcAction.kt#L39-L41

And client-side code:

https://github.com/Hexcles/wire/blob/fa9f1e2b7d16fc2364a62b45381d42dd9323a439/samples/wire-grpc-sample/client-simple/src/main/java/com/squareup/wire/whiteboard/SimpleGrpcClient.kt#L14

Hexcles commented 2 months ago

Note that both sides use wire-grpc, not upstream grpc-java from Google. They are supposedly compatible on the wire, but apparently there's something unique with the frames produced by wire-grpc (otherwise, you'd have a lot of bug reports from grpc users already).

olix0r commented 2 months ago

Thanks. This repro will be enough for us to track this down.

We're currently working on some other retry improvements (that will also address #12826). The good news is that I've tried your repro against the branch of new work. We're going to prioritize making the new functionality available on an edge release; but we'll follow up to ensure this underlying issue is eliminated.

Hexcles commented 2 months ago

The good news is that I've tried your repro against the branch of new work.

Do you mean you can reproduce the panic on stable, and the WIP feature in edge no longer exhibits the panic? That's great news!

olix0r commented 2 months ago

Ah, yeah. The WIP fixes the issue.

I believe it's caused by inconsistent framing emitted by wire-grpc...

A typical stream loooks like:

[     http:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(3) }
[     http:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(3), flags: (0x1: END_STREAM) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(1) }

Importantly, there is a data frame with an END_STREAM flag.

On the second request, however, no such END_STREAM is set:

[     http:Connection{peer=Server}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
[     http:Connection{peer=Server}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(1), flags: (0x5: END_HEADERS | END_STREAM) }
[     http: linkerd_proxy_http::classify::channel: dropping ResponseBody
[     http:Connection{peer=Server}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(3), flags: (0x5: END_HEADERS | END_STREAM) }
[     http:Connection{peer=Server}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(5), flags: (0x4: END_HEADERS) }
[     http:Connection{peer=Server}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(5) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f4d96031e00, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 2, rx_fields: "..." } } } })
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint: linkerd_proxy_http::client: method=POST uri=http://server/com.squareup.wire.whiteboard.Whiteboard/Echo version=HTTP/2.0
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(3) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Headers { stream_id: StreamId(3), flags: (0x4: END_HEADERS) }
[     service{ns=default name=server port=80}:pool:endpoint{addr=10.42.0.80:8080}:http.endpoint:h2:Connection{peer=Client}: h2::codec::framed_read: received frame=Data { stream_id: StreamId(3) }

When the server responds before the request stream has completed, it appears to put the retry middleware into a bad state... But this is valid at the protocol level and in any case we should never crash here...

We'll update the issue when something is available to test on edge.

olix0r commented 2 months ago

edge-24.7.5 includes support for GRPCRoute resource annotations that enable timeout and retry configurations. We'll be working on more official documentation, but I wanted to share a quick demo of how to use these new configs. I've udpated the wire-grpc example manifets with a route configuration like:

---
kind: GRPCRoute
apiVersion: gateway.networking.k8s.io/v1alpha2
metadata:
  name: whiteboard-echo
  annotations:
    retry.linkerd.io/grpc: internal
    retry.linkerd.io/limit: "2"
    retry.linkerd.io/timeout: 150ms
    timeout.linkerd.io/request: 1s
spec:
  parentRefs:
    - name: whiteboard
      kind: Service
      group: core
  rules:
    - matches:
        - method:
            type: Exact
            service: com.squareup.wire.whiteboard.Whiteboard
            method: Echo
...

The retry.linkerd.io/grpc annotation can be used to configure a list of status codes:

metadata:
  annotations:
    retry.linkerd.io/grpc: cancelled,deadline-exceeded,internal,resource-exhausted,unavailable

While the demo app doesn't actually trigger timeouts or retries, we are able to observe gRPC-status aware route metrics:

# HELP outbound_grpc_route_request_duration_seconds The time between request initialization and response completion.
# TYPE outbound_grpc_route_request_duration_seconds histogram
# UNIT outbound_grpc_route_request_duration_seconds seconds
outbound_grpc_route_request_duration_seconds_sum{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 2.269708098
outbound_grpc_route_request_duration_seconds_count{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
outbound_grpc_route_request_duration_seconds_bucket{le="0.05",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
outbound_grpc_route_request_duration_seconds_bucket{le="0.5",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
outbound_grpc_route_request_duration_seconds_bucket{le="1.0",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
outbound_grpc_route_request_duration_seconds_bucket{le="10.0",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
outbound_grpc_route_request_duration_seconds_bucket{le="+Inf",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 197
# HELP outbound_grpc_route_request_statuses Completed request-response streams.
# TYPE outbound_grpc_route_request_statuses counter
outbound_grpc_route_request_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",grpc_status="OK",error=""} 197
# HELP outbound_grpc_route_backend_requests The total number of requests dispatched.
# TYPE outbound_grpc_route_backend_requests counter
outbound_grpc_route_backend_requests_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
# HELP outbound_grpc_route_backend_response_duration_seconds The time between request completion and response completion.
# TYPE outbound_grpc_route_backend_response_duration_seconds histogram
# UNIT outbound_grpc_route_backend_response_duration_seconds seconds
outbound_grpc_route_backend_response_duration_seconds_sum{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 0.33726197
outbound_grpc_route_backend_response_duration_seconds_count{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="0.025",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="0.05",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="0.1",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="0.25",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="0.5",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="1.0",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="10.0",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
outbound_grpc_route_backend_response_duration_seconds_bucket{le="+Inf",parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name=""} 197
# HELP outbound_grpc_route_backend_response_statuses Completed responses.
# TYPE outbound_grpc_route_backend_response_statuses counter
outbound_grpc_route_backend_response_statuses_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo",backend_group="core",backend_kind="Service",backend_namespace="default",backend_name="whiteboard",backend_port="80",backend_section_name="",grpc_status="OK",error=""} 197
# HELP outbound_grpc_route_retry_limit_exceeded Retryable requests not sent due to retry limits.
# TYPE outbound_grpc_route_retry_limit_exceeded counter
outbound_grpc_route_retry_limit_exceeded_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 0
# HELP outbound_grpc_route_retry_overflow Retryable requests not sent due to circuit breakers.
# TYPE outbound_grpc_route_retry_overflow counter
outbound_grpc_route_retry_overflow_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 0
# HELP outbound_grpc_route_retry_requests Retry requests emitted.
# TYPE outbound_grpc_route_retry_requests counter
outbound_grpc_route_retry_requests_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 0
# HELP outbound_grpc_route_retry_successes Successful responses to retry requests.
# TYPE outbound_grpc_route_retry_successes counter
outbound_grpc_route_retry_successes_total{parent_group="core",parent_kind="Service",parent_namespace="default",parent_name="whiteboard",parent_port="80",parent_section_name="",route_group="gateway.networking.k8s.io",route_kind="GRPCRoute",route_namespace="default",route_name="whiteboard-echo"} 0

I'll leave this issue open until we ensure this issue is fixed in in the ServiceProfile router as well.

Hexcles commented 2 months ago

IIUC HttpRoute doesn't work with ServiceProfile. Does GrpcRoute not work with ServiceProfile as well?

olix0r commented 2 months ago

Correct, this is a mutually exclusive routing interface.

Hexcles commented 1 month ago

Apologies for the nudge, but any plan to fix this in ServiceProfile soon-ish? Thanks!

kflynn commented 1 month ago

@Hexcles Hey, nothing concrete yet – we're working out how to get this done.

cratelyn commented 1 week ago

Hello @Hexcles!

Thank you for your patience regarding a fix for this issue in ServiceProfile. #3216 recently fixed this bug, and I confirmed that the repro you provided above no longer panics with this patch applied.

That patch will be included in the upcoming weekly edge release. Thank you for filing this issue, and for narrowing the problem down to a concise repro, it was very helpful!