hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.74k stars 1.94k forks source link

Exposed GRPC health check does not work (H2C) #12569

Open seanamos opened 2 years ago

seanamos commented 2 years ago

Versions

Nomad v1.2.6 Consul v1.11.4

Operating system and Environment details

Ubuntu 20.04

Issue

I'm not sure if this is Nomad's or Consul's side.

When attempting to expose a GRPC check (either with expose = true or a manual proxy -> expose -> path) without TLS (H2C/HTTP2 prior knowledge), the health check always fails.

The GRPC check is only able to pass if a port is exposed with network -> port { to = xxx } and this port is used in the health check. This however leaves a port open that bypasses the service mesh.

Reproduction steps

consul agent -dev
nomad agent -dev-connect
nomad job run myjob.nomad # see below job file

Expected Result

H2C GRPC check is able to succeed.

Actual Result

H2C GRPC check always fails.

Job file (if appropriate)

job "test" {
  group "test" {
    network {
      mode = "bridge"
      # doesn't work
      # port "grpc_expose" { to = -1 }

      # does work but punches a whole in the service mesh
      # port "grpc" { to = 80 }
    }

    service {
      name = "test"
      tags = ["grpc"]
      port = "80"

      check {
        name         = "test-grpc"
        type         = "grpc"
        interval     = "15s"
        timeout      = "2s"
        grpc_service = "All"
        expose       = true
        # doesn't work
        # port       = "grpc_expose"

        # does work but punches a whole in the service mesh
        # port       = "grpc"
      }

      connect {
        sidecar_service {
          # doesn't work
          # proxy {
          #   expose {
          #     path {
          #       path            = "/grpc.health.v1.Health/Check"
          #       protocol        = "http2"
          #       local_path_port = 80
          #       listener_port   = "grpc_expose"
          #     }
          #   }
          # }
        }
      }
    }

    task "test-api" {
      driver = "docker"
      config {
        image = "my-image-with-h2c-grpc:latest"
      }
    }
  }
}

While debugging this, I set the envoy sidecar logging level to trace and I believe I found some relevant logs to this problem (trimmed to relevant parts):

...
[2022-04-14 22:55:15.162][16][debug][http] [source/common/http/conn_manager_impl.cc:867] [C17][S17393229078070284426] request headers complete (end_stream=false):
':method', 'POST'
':scheme', 'http'
':path', '/grpc.health.v1.Health/Check'
':authority', '192.168.1.110:28739'
'content-type', 'application/grpc'
'user-agent', 'grpc-go/1.25.1'
'te', 'trailers'
'grpc-timeout', '1999097u'

[2022-04-14 22:55:15.162][16][debug][router] [source/common/router/router.cc:457] [C17][S17393229078070284426] cluster 'local_app' match for URL '/grpc.health.v1.Health/Check'
[2022-04-14 22:55:15.162][16][debug][router] [source/common/router/router.cc:673] [C17][S17393229078070284426] router decoding headers:
':method', 'POST'
':scheme', 'http'
':path', '/grpc.health.v1.Health/Check'
':authority', '192.168.1.110:28739'
'content-type', 'application/grpc'
'user-agent', 'grpc-go/1.25.1'
'te', 'trailers'
'grpc-timeout', '1999097u'
'x-forwarded-proto', 'http'
'x-request-id', '82283a81-9101-45cf-8f8f-af538d87d350'
'x-envoy-expected-rq-timeout-ms', '15000'

[2022-04-14 22:55:15.162][16][debug][pool] [source/common/http/conn_pool_base.cc:74] queueing stream due to no available connections
[2022-04-14 22:55:15.162][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:256] trying to create new connection
[2022-04-14 22:55:15.162][16][trace][pool] [source/common/conn_pool/conn_pool_base.cc:257] ConnPoolImplBase 0x4cd47f31e4b0, ready_clients_.size(): 0, busy_clients_.size(): 0, connecting_clients_.size(): 0, connecting_stream_capacity_: 0, num_active_streams_: 0, pending_streams_.size(): 1 per upstream preconnect ratio: 1
[2022-04-14 22:55:15.162][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:144] creating a new connection
[2022-04-14 22:55:15.163][16][debug][client] [source/common/http/codec_client.cc:60] [C18] connecting
[2022-04-14 22:55:15.163][16][debug][connection] [source/common/network/connection_impl.cc:890] [C18] connecting to 127.0.0.1:80
[2022-04-14 22:55:15.163][16][debug][connection] [source/common/network/connection_impl.cc:909] [C18] connection in progress
[2022-04-14 22:55:15.163][16][trace][pool] [source/common/conn_pool/conn_pool_base.cc:130] not creating a new connection, shouldCreateNewConnection returned false.
[2022-04-14 22:55:15.163][16][trace][http] [source/common/http/filter_manager.cc:553] [C17][S17393229078070284426] decode headers called: filter=0x4cd47f25cee0 status=1
[2022-04-14 22:55:15.163][16][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x4cd47f388300 for 300000ms, min is 300000ms
[2022-04-14 22:55:15.163][16][trace][http2] [source/common/http/http2/codec_impl.cc:814] [C17] about to recv frame type=0, flags=1, stream_id=1
[2022-04-14 22:55:15.163][16][trace][http2] [source/common/http/http2/codec_impl.cc:842] [C17] recv frame type=0
[2022-04-14 22:55:15.163][16][trace][http2] [source/common/http/http2/codec_impl.cc:1800] [C17] track inbound frame type=0 flags=1 length=10 padding_length=0
[2022-04-14 22:55:15.163][16][debug][http] [source/common/http/filter_manager.cc:841] [C17][S17393229078070284426] request end stream
[2022-04-14 22:55:15.163][16][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x4cd47f388300 for 300000ms, min is 300000ms
[2022-04-14 22:55:15.163][16][trace][router] [source/common/router/upstream_request.cc:246] [C17][S17393229078070284426] buffering 10 bytes
[2022-04-14 22:55:15.163][16][trace][http] [source/common/http/filter_manager.cc:691] [C17][S17393229078070284426] decode data called: filter=0x4cd47f25cee0 status=3
[2022-04-14 22:55:15.163][16][trace][http2] [source/common/http/http2/codec_impl.cc:745] [C17] dispatched 123 bytes
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:554] [C18] socket event: 2
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:663] [C18] write ready
[2022-04-14 22:55:15.163][16][debug][connection] [source/common/network/connection_impl.cc:672] [C18] connected
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:417] [C18] raising connection event 2
[2022-04-14 22:55:15.163][16][debug][client] [source/common/http/codec_client.cc:88] [C18] connected
[2022-04-14 22:55:15.163][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:294] [C18] attaching to next stream
[2022-04-14 22:55:15.163][16][debug][pool] [source/common/conn_pool/conn_pool_base.cc:177] [C18] creating stream
[2022-04-14 22:55:15.163][16][debug][router] [source/common/router/upstream_request.cc:416] [C17][S17393229078070284426] pool ready
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:477] [C18] writing 315 bytes, end_stream false
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:477] [C18] writing 20 bytes, end_stream false
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:663] [C18] write ready
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/raw_buffer_socket.cc:67] [C18] write returns: 335
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:554] [C18] socket event: 2
[2022-04-14 22:55:15.163][16][trace][connection] [source/common/network/connection_impl.cc:663] [C18] write ready
[2022-04-14 22:55:15.168][16][trace][connection] [source/common/network/connection_impl.cc:554] [C18] socket event: 3
[2022-04-14 22:55:15.168][16][trace][connection] [source/common/network/connection_impl.cc:663] [C18] write ready
[2022-04-14 22:55:15.168][16][trace][connection] [source/common/network/connection_impl.cc:592] [C18] read ready. dispatch_buffered_data=false
[2022-04-14 22:55:15.168][16][trace][connection] [source/common/network/raw_buffer_socket.cc:24] [C18] read returns: 17
[2022-04-14 22:55:15.169][16][trace][connection] [source/common/network/raw_buffer_socket.cc:24] [C18] read returns: 0
[2022-04-14 22:55:15.169][16][trace][http] [source/common/http/http1/codec_impl.cc:579] [C18] parsing 17 bytes
[2022-04-14 22:55:15.169][16][debug][client] [source/common/http/codec_client.cc:154] [C18] Error dispatching received data: http/1.1 protocol error: HPE_INVALID_CONSTANT
[2022-04-14 22:55:15.169][16][debug][connection] [source/common/network/connection_impl.cc:138] [C18] closing data_to_write=0 type=1
[2022-04-14 22:55:15.169][16][debug][connection] [source/common/network/connection_impl.cc:249] [C18] closing socket: 1
[2022-04-14 22:55:15.169][16][trace][connection] [source/common/network/connection_impl.cc:417] [C18] raising connection event 1
[2022-04-14 22:55:15.169][16][debug][client] [source/common/http/codec_client.cc:106] [C18] disconnect. resetting 1 pending requests
[2022-04-14 22:55:15.169][16][debug][client] [source/common/http/codec_client.cc:142] [C18] request reset
[2022-04-14 22:55:15.169][16][trace][main] [source/common/event/dispatcher_impl.cc:255] item added to deferred deletion list (size=1)
[2022-04-14 22:55:15.169][16][debug][router] [source/common/router/router.cc:1130] [C17][S17393229078070284426] upstream reset: reset reason: protocol error, transport failure reason: 
[2022-04-14 22:55:15.169][16][debug][http] [source/common/http/filter_manager.cc:953] [C17][S17393229078070284426] Sending local reply with details upstream_reset_before_response_started{protocol error}
[2022-04-14 22:55:15.169][16][trace][misc] [source/common/event/scaled_range_timer_manager_impl.cc:60] enableTimer called on 0x4cd47f388300 for 300000ms, min is 300000ms
[2022-04-14 22:55:15.169][16][debug][http] [source/common/http/conn_manager_impl.cc:1467] [C17][S17393229078070284426] encoding headers via codec (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: protocol error'
'date', 'Thu, 14 Apr 2022 22:55:14 GMT'
'server', 'envoy'
...

EDIT: Updated attached logs, originally included logs were from a different test

Amier3 commented 2 years ago

Hey @seanamos

Thanks for bringing this to our attention. GIve us some time to take a look at this and we'll get back to you with what we find 👍

ovelascoh commented 1 year ago

Hi, we are experiencing the exact same problem as we can't make the GRPC health checks work unless we expose the port, but this actually goes against the main mesh objective which is to secure the service-to-service communication. Has there been any progress on this issue?

Thanks!

CarelvanHeerden commented 1 year ago

Hello Folks

Exact same issue.

Nomad: v1.4.3 Consul: v1.14.4

Has anyone found a different workaround that we could attempt in the meantime? To expose the port does feel counterproductive.

Thanks All

Zygimantass commented 5 months ago

hi folks, are there any updates on this issue? running into the exact same problem

jzhao20230918 commented 2 months ago

hi, same issue. Any updates? Thanks.