vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.54k stars 1.54k forks source link

Ensure `vector` sources/sink works with AWS ALBs #9901

Open jszwedko opened 2 years ago

jszwedko commented 2 years ago

We should setup a demo in https://github.com/timberio/vector-demos using an AWS ALB. If I remember right, last time we tried this, we ran into a gRPC protocol error that we haven't resolved yet.

spencergilbert commented 2 years ago

I think it was something like https://stackoverflow.com/questions/60044046/how-to-setup-grpc-ingress-on-gke-w-nginx-ingress which I'm also running into while testing against nginx-ingress

gbregar commented 2 years ago

The error I'm getting on the agent vector instance while trying to send data to an aggregator which is behind AWS ALB is "http2 error: connection error detected: frame with invalid size". The only relevant result I got with this error message is: https://github.com/hyperium/hyper/issues/1574

tesharp commented 2 years ago

Was running into same problem with nginx-ingress. Trying to do TLS termination on ingress.

Like the post @spencergilbert linked to it has a problem with the initial "PRI * HTTP/2.0" package. Nginx just returns 400. Strangely enough nginx also says schema for this package is http and not https, even if tls is enables.

After lots of testing I did manage to get it to work based on the post linked by @gbregar (thanks!)

Setting the connection to use only h2 protocol in sinks\vector\v2\config.rs -> new_client:

https.set_callback(move |c, _uri| {
    if let Some(settings) = &settings {
        c.set_alpn_protos(b"\x02h2")?;
        settings.apply_connect_configuration(c);
    }

    Ok(())
});

it works! now, dont know if this is the right fix for this.. but at least fixed problem with nginx ingress

jszwedko commented 2 years ago

Interesting, thank you for that note @tesharp ! We'll test this and integrate it.

mr-karan commented 1 year ago

Is there any more progress on this? I'd really appreciate some help in using ALBs for Vector sources. Since I am using version=2 (which is the gRPC version), I've set the protocol_version for target groups to be gRPC. However, I am unsure what should be the correct healthcheck config here:

Port 6000 is where vector source listens on.

This is what I have and it's failing:

image

gopiio commented 1 year ago

Healthcheck path is incorrect. If you set the backend-protocol to GRPC then path should be /grpc.health.v1.Health/Check

mr-karan commented 1 year ago

@gopiio Thanks for the help! The following config worked for healthcheck.

image

However, I am still unable to connect it for vector sink. Anything that I am missing here?

image

Verbose logging:

2022-10-15T06:38:36.072903Z DEBUG hyper::client::connect::http: connecting to 172.31.yyy.xxx:443
2022-10-15T06:38:36.112687Z DEBUG hyper::client::connect::http: connected to 172.31.yyy.xxx:443
2022-10-15T06:38:36.199539Z DEBUG h2::client: binding client connection
2022-10-15T06:38:36.199585Z DEBUG h2::client: client connection bound
2022-10-15T06:38:36.200495Z DEBUG h2::codec::framed_write: send frame=Settings { flags: (0x0), enable_push: 0, initial_window_size: 2097152, max_frame_size: 16384 }
2022-10-15T06:38:36.201429Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=WindowUpdate { stream_id: StreamId(0), size_increment: 5177345 }
2022-10-15T06:38:36.201483Z DEBUG hyper::client::pool: pooling idle connection for ("https", alb-address:443)
2022-10-15T06:38:36.202630Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Headers { stream_id: StreamId(1), flags: (0x4: END_HEADERS) }
2022-10-15T06:38:36.202683Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1) }
2022-10-15T06:38:36.202699Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=Data { stream_id: StreamId(1), flags: (0x1: END_STREAM) }
2022-10-15T06:38:36.240701Z DEBUG Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", FRAME_SIZE_ERROR, Library)
2022-10-15T06:38:36.240739Z DEBUG Connection{peer=Client}: h2::codec::framed_write: send frame=GoAway { error_code: FRAME_SIZE_ERROR, last_stream_id: StreamId(0) }
2022-10-15T06:38:36.240748Z DEBUG Connection{peer=Client}: h2::proto::connection: Connection::poll; connection error error=GoAway(b"", FRAME_SIZE_ERROR, Library)
2022-10-15T06:38:36.240785Z DEBUG hyper::client::client: client connection error: http2 error: connection error detected: frame with invalid size
2022-10-15T06:38:36.241060Z DEBUG hyper::proto::h2::client: connection error: connection error detected: frame with invalid size
2022-10-15T06:38:36.241367Z DEBUG hyper::proto::h2::client: client response error: connection error detected: frame with invalid size
2022-10-15T06:38:36.241426Z ERROR vector::topology::builder: msg="Healthcheck: Failed Reason." error=Vector source unhealthy component_kind="sink" component_type="vector" component_id=vector_out component_name=vector_out

Sink config:

[sinks.vector_out]
type = "vector"
inputs = ["tag_events"]
address = "alb-address:443"
version = "2"

Source config:

[sources.vector_agents]
type = "vector"
address = "0.0.0.0:6000"
version = "2"

When cURLing the ALB, I see this:

curl -X POST --verbose -I --http2 https://alb-address

*   Trying 172.31.xxx.yyy:443...
* Connected to alb-address (172.31.xxx.yyy) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS header, Finished (20):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS header, Finished (20):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.domain.tld
*  start date: Dec 24 00:00:00 2021 GMT
*  expire date: Jan  7 23:59:59 2023 GMT
*  subjectAltName: host "alb-address" matched cert's "*.domain.tld"
*  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Domain Validation Secure Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* Using Stream ID: 1 (easy handle 0x5642781d2e80)
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
> POST / HTTP/2
> Host: alb-address
> user-agent: curl/7.81.0
> accept: */*
> 
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
< HTTP/2 502 
HTTP/2 502 
< server: awselb/2.0
server: awselb/2.0
< date: Sat, 15 Oct 2022 07:05:06 GMT
date: Sat, 15 Oct 2022 07:05:06 GMT
< content-type: text/html
content-type: text/html
< content-length: 122
content-length: 122

< 
* Excess found: excess = 122 url = / (zero-length body)
* Connection #0 to host alb-address left intact
jszwedko commented 1 year ago

Hi @gopiio !

I think you need to set: tls.apln_protocols = "h2" on the vector sink. Or, at least, one other user had success with that and AWS ALBs. Hopefully this works for you too! Assuming it does, we should add this mention to the documentation.

dizlv commented 1 year ago

@jszwedko just tried to glue two Vector topologies with vector sink and source in-between, and it failed.

  1. Topolgy 1 -> AWS ALB -> Topology 2
  2. AWS ALB has HTTPS listener, HTTP with gRPC in Target Group

Component has following configurations:

tls.apln_protocols = "h2"
tls.enabled = true

And it fails with

{"timestamp":"2022-11-22T15:50:50.975966Z","level":"WARN","message":"Retrying after error.","error":"Request failed: status: Unknown, message: \"http2 error: connection error detected: frame with invalid size\", details: [], metadata: MetadataMap { headers: {} }","internal_log_rate_limit":true,"target":"vector::sinks::util::retries","span":{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"},"spans":[{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"}]}

Do you have any working examples or tips on make this working?

Thank you!

jszwedko commented 1 year ago

@jszwedko just tried to glue two Vector topologies with vector sink and source in-between, and it failed.

  1. Topolgy 1 -> AWS ALB -> Topology 2
  2. AWS ALB has HTTPS listener, HTTP with gRPC in Target Group

Component has following configurations:

tls.apln_protocols = "h2"
tls.enabled = true

And it fails with

{"timestamp":"2022-11-22T15:50:50.975966Z","level":"WARN","message":"Retrying after error.","error":"Request failed: status: Unknown, message: \"http2 error: connection error detected: frame with invalid size\", details: [], metadata: MetadataMap { headers: {} }","internal_log_rate_limit":true,"target":"vector::sinks::util::retries","span":{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"},"spans":[{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"}]}

Do you have any working examples or tips on make this working?

Thank you!

Hmm, that's odd. Another user had reported it working for them with that configuration. Which version of Vector? You set it on the sink, correct?

dizlv commented 1 year ago

Yes, it is configured on the sink, Vector 0.25.1

jszwedko commented 1 year ago

@jszwedko just tried to glue two Vector topologies with vector sink and source in-between, and it failed.

  1. Topolgy 1 -> AWS ALB -> Topology 2
  2. AWS ALB has HTTPS listener, HTTP with gRPC in Target Group

Component has following configurations:

tls.apln_protocols = "h2"
tls.enabled = true

And it fails with

{"timestamp":"2022-11-22T15:50:50.975966Z","level":"WARN","message":"Retrying after error.","error":"Request failed: status: Unknown, message: \"http2 error: connection error detected: frame with invalid size\", details: [], metadata: MetadataMap { headers: {} }","internal_log_rate_limit":true,"target":"vector::sinks::util::retries","span":{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"},"spans":[{"component_id":"sink_name","component_kind":"sink","component_name":"sink_name","component_type":"vector","name":"sink"}]}

Do you have any working examples or tips on make this working?

Thank you!

Gotcha, I'm not sure why it isn't working for you then unfortunately. Someone will probably have to dig back into this issue. It is in our backlog.

dizlv commented 1 year ago

I got some more time to investigate this issue and actually found some rough edges, but successfully launched working setup at the end with AWS stack.

Working configurations file:

[sources.test]
type = "demo_logs"
format = "shuffle"
lines = ["hello, world"]
interval = 2

[sinks.vector]
type = "vector"
tls.enabled = true
tls.alpn_protocols = ["h2"]  # Note: should be a list, not string. And do not type in the field name, Vector won't notice it, instead just runs with default value (which is None).
inputs = ["test"]
address = "https://fqdn:443"  # Note: do not forget https! It is important. Otherwise the code does some magic stuff and provision default values.

And source:

[sources.vector]
type = "vector"
address = "0.0.0.0:9000"
version = "2"
  1. Create Application Load Balancer with HTTPS listener (do not forget to attach proper certificate)
  2. Create Target Group with HTTP protocol (since TLS is offloaded on the ALB) & configured port (9000 in my example). Protocol version should be gRPC and health-check path /grpc.health.v1.Health/Check.

Why it did not work before:

I think you need to set: tls.apln_protocols = "h2" on the vector sink

  1. Proper field name is alpn_protocols, not apln_protocols
  2. Proper value is a list, containing "h2"
  3. Apparently code does some magic hand waving and applies http for protocol, if it is not specified in address field

I think that in general there are 2 issues:

  1. Vector does not complain about wrong configuration values (at least for this TLS options)
  2. "h2" as a str passes for the value in the Vec field

I'm not sure if the third one – some non-intuitive default values logic – is a design problem or lack of documentation or is a problem at all.

tim-klarna commented 1 year ago

So vector doesn't fail with typos in configuration keys? (apln_protocols vs alpn_protocols). And also accepts garbage values and invalid types in configuration too.

I can see now why I couldn't get this working when I tried 😅

tim-klarna commented 1 year ago

Oh, and for security reasons, probably better to default to https and not http!

spencergilbert commented 1 year ago

So vector doesn't fail with typos in configuration keys? (apln_protocols vs alpn_protocols). And also accepts garbage values and invalid types in configuration too.

I can see now why I couldn't get this working when I tried 😅

For the keys, unfortunately that's on a per component basis right now, and we've not implemented that consistently - I have marked it as a place to improve the UX.

On the value side things are generally string options, rather than enums - which is another place we can improve if we know there is a known set of possible options.

hong823 commented 6 months ago

We also faced issues with getting ALB working with vector, after some trial and error we got it working with our ALB + EKS using AWS ALB Controller. In the version of the vector we're using, the health check endpoint seems to be /vector.Vector/HealthCheck.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/certificate-arn: <CERT ARN>
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/backend-protocol-version: "GRPC"
    alb.ingress.kubernetes.io/backend-protocol: "HTTP"

    alb.ingress.kubernetes.io/healthcheck-path: /vector.Vector/HealthCheck
    alb.ingress.kubernetes.io/healthcheck-port: "9000"
    alb.ingress.kubernetes.io/success-codes: "0"
  name: ingress-test
spec:
  ingressClassName: alb
  rules:
    - host: vector-intake.example.com
      http:
        paths:
          - backend:
              service:
                name: vector
                port:
                  name: 9000
            pathType: ImplementationSpecific

Vector config:

sinks:
  vector:
    type: vector
    address: https://vector-intake.example.com
    version: "2"
    tls:
      enabled: true
      alpn_protocols: ["h2"]

sources:
  vector_in:
    type: vector
    address: 0.0.0.0:9000
    version: "2"

We tried to get it working with nginx ingress controller but kept getting 464 HTTP error.

Hope this is helpful for anyone.