vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.46k stars 1.53k forks source link

vector is sometimes not re-requesting credentials before they expire when writing to aws_s3 sink #11837

Closed breathe closed 2 years ago

breathe commented 2 years ago

A note for the community

Problem

I have been seeing occasional spurious authentication failures from vector across our fleet. The vector process suddenly starts producing output indicating failures to authenticate to s3:

Mar 11 12:20:06 some.host vector[21633]: 2022-03-11T12:20:06.049907Z ERROR sink{component_kind="sink" component_id=log-sink-s3 component_type=aws_s3 component_name=log-sink-s3}:request{request_id=1886}: vector_core::stream::driver: Service call failed. error=Unknown(BufferedHttpResponse {status: 403, body: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>W2FAZWNKW59F8G1S</RequestId><HostId>QDXb8IWETX2bx05KM1dl9Prbkg5ZTLg+Mob5I5gZWc2DKs29w+CSIaTasSA5j8sk7pzLZw9LqX0=</HostId></Error>", headers: {"x-amz-request-id": "W2FAZWNKW59F8G1S", "x-amz-id-2": "QDXb8IWETX2bx05KM1dl9Prbkg5ZTLg+Mob5I5gZWc2DKs29w+CSIaTasSA5j8sk7pzLZw9LqX0=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 11 Mar 2022 12:20:05 GMT", "server": "AmazonS3", "connection": "close"} }) request_id=1886
Mar 11 12:25:06 some.host vector[21633]: 2022-03-11T12:25:06.426825Z ERROR sink{component_kind="sink" component_id=log-sink-s3 component_type=aws_s3 component_name=log-sink-s3}:request{request_id=1887}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=Request ID: None Body: <?xml version="1.0" encoding="UTF-8"?>
Mar 11 12:25:06 some.host vector[21633]: <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>1JPQM4Z0H9JSF001</RequestId><HostId>9WNLSsFZtJQUB2sri5dEpf9itKz3RGw3izgssGUH+4kcUZz1N75pkArZAGVHuY7wKj55WbaIQOI=</HostId></Error>
Mar 11 12:25:06 some.host vector[21633]: 2022-03-11T12:25:06.426876Z ERROR sink{component_kind="sink" component_id=log-sink-s3 component_type=aws_s3 component_name=log-sink-s3}:request{request_id=1887}: vector_core::stream::driver: Service call failed. error=Unknown(BufferedHttpResponse {status: 403, body: "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>1JPQM4Z0H9JSF001</RequestId><HostId>9WNLSsFZtJQUB2sri5dEpf9itKz3RGw3izgssGUH+4kcUZz1N75pkArZAGVHuY7wKj55WbaIQOI=</HostId></Error>", headers: {"x-amz-request-id": "1JPQM4Z0H9JSF001", "x-amz-id-2": "9WNLSsFZtJQUB2sri5dEpf9itKz3RGw3izgssGUH+4kcUZz1N75pkArZAGVHuY7wKj55WbaIQOI=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Fri, 11 Mar 2022 12:25:05 GMT", "server": "AmazonS3", "connection": "close"} }) request_id=1887
...

This seems to occasionally affect a single host. Its been observed to self-correct after a few minutes -- but most recently the error persisted for several hours on a host without self-correcting -- so I restarted the vector process manually and that resolved the issue ...

Vector is configured to pull credentials for a role with write privileges to the relevant bucket using an aws credential_process -- and I suspect that the vector agent isn't always refreshing credentials prior to them expiring.

Configuration

$ vector --version
vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)
$ sudo cat /etc/vector/vector.yaml
api:
    enabled: true
data_dir: /var/lib/vector
sinks:
    log-sink-s3:
        bucket: ${VECTOR_SINK_BUCKET}
        compression: gzip
        encoding:
            codec: ndjson
        healthcheck:
            enabled: true
        inputs:
        - tag-logs
        key_prefix: date=%Y/%m/%d/%H/
        region: ${VECTOR_SINK_BUCKET_REGION}
        type: aws_s3
sources:
    internal-metric-vector:
        scrape_interval_secs: 5
        type: internal_metrics
    log-core:
        include:
        - /media/ephemeral/...
        - /media/ephemeral/...
        - /media/ephemeral/...
        - /media/ephemeral/...
        max_line_bytes: 1048576
        read_from: beginning
        type: file
transforms:
    tag-logs:
        inputs:
        - log-*
        source: '.env = "${VECTOR_ENV}"
            .provider = "${VECTOR_PROVIDER_NAME}"
            .region_group = "${VECTOR_REGION_GROUP}"
            .cluster = "${VECTOR_CLUSTER}"
            .replica = "${VECTOR_REPLICA}"
            .region = "${VECTOR_REGION}"
            .host = "${VECTOR_HOST}"
            .sovereignty = "${VECTOR_SOVEREIGNTY}"
            '
        type: remap
$ cat /etc/default/vector
VECTOR_ENV=...
VECTOR_PROVIDER_NAME=...
VECTOR_REGION_GROUP=...
VECTOR_CLUSTER=...
VECTOR_REPLICA=...
VECTOR_REGION=...
VECTOR_HOST=...
VECTOR_SOVEREIGNTY=...
VECTOR_SINK_BUCKET_REGION=...
VECTOR_SINK_BUCKET=...
AWS_CONFIG_FILE=/etc/vector/vector_aws_config
VECTOR_CONFIG_YAML=/etc/vector/vector.yaml
$ cat /etc/vector/vector_aws_config
[default]
credential_process=/usr/local/bin/get-aws-credential-command
$ sudo -u vector /usr/local/bin/get-aws-credential-command
{
  "Version": 1,
  "AccessKeyId": "XXX",
  "SecretAccessKey": "XXX",
  "SessionToken": "XXX",
  "Expiration": "2022-03-11T23:38:47+0000"
}

Version

vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

Debug Output

I enabled debug output to see if it provided any additional clues -- but it didn't offer anything sugestive

Example Data

No response

Additional Context

Vector is being run by systemd

systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

And we are using the prebuilt rpm

rpm -qa | grep vector
vector-0.20.0-1.x86_64

References

No response

jszwedko commented 2 years ago

Hi @breathe !

Thanks for reporting. We are in the process of replacing the deprecated rusoto library with the AWS maintained SDK. I'd be curious if you still see this behavior once that is complete. If you do, we can dig in then. The aws_s3 sink SDK replacement should happen in time for 0.21.0 next week.

breathe commented 2 years ago

I'd be curious if you still see this behavior once that is complete

Alright that sounds promising -- will plan to upgrade to 0.21.0 after it drops and observe

breathe commented 2 years ago

I saw that 0.20.1 is out -- https://vector.dev/releases/0.20.1/#changelog -- I believe this release does not include the migration of the aws_s3 sink off of rusoto but just wanted to double-check ...

jszwedko commented 2 years ago

Hi @breathe ! That's correct, but it will be in 0.21.0 which should go out this week.

breathe commented 2 years ago

I upgraded to 0.21.0 in one of our environments -- unfortunately it seems that 0.21.0 is not able to authenticate for usage of the aws_s3 sink at all ...

I had checked over the changelog and I don't see anything that needs adjusting on our side. I've a copy of our configuration above (still the same). We already used the AWS_CONFIG_FILE env var to supply a custom path to an aws credentials file -- and that credentials file uses the credentials_process sdk configuration parameter (which gets short-lived credentials via an external process).

We also had the region parameter for the target bucket already defined in the s3 sink -- but I tried defining that parameter in the aws config file as well with no change in behavior ...

Here are some example error logs if I disable the health-check in order to try to get more info ...

Apr 19 14:50:13 somehost vector[27408]: 2022-04-19T14:50:13.985185Z ERROR sink{component_kind="sink" component_id=log-sink-s3 component_type=aws_s3 component_name=log-sink-s3}:request{request_id=208}: vector::sinks::util::retries: Non-retriable error; dropping the request. error=Error { code: "AccessDenied", message: "Access Denied", request_id: "SWTXZ1G45FAPX245", s3_extended_request_id: "u38tMil6E44zlnHeAzwWTC563SjLYx/feAajo5CITzkAPseDauVoh9pg7pO+4pwLl2Rnd55u4k8=" }
Apr 19 14:50:13 somehost vector[27408]: 2022-04-19T14:50:13.985302Z ERROR sink{component_kind="sink" component_id=log-sink-s3 component_type=aws_s3 component_name=log-sink-s3}:request{request_id=208}: vector_core::stream::driver: Service call failed. error=ServiceError { err: PutObjectError { kind: Unhandled(Error { code: Some("AccessDenied"), message: Some("Access Denied"), request_id: Some("SWTXZ1G45FAPX245"), extras: {"s3_extended_request_id": "u38tMil6E44zlnHeAzwWTC563SjLYx/feAajo5CITzkAPseDauVoh9pg7pO+4pwLl2Rnd55u4k8="} }), meta: Error { code: Some("AccessDenied"), message: Some("Access Denied"), request_id: Some("SWTXZ1G45FAPX245"), extras: {"s3_extended_request_id": "u38tMil6E44zlnHeAzwWTC563SjLYx/feAajo5CITzkAPseDauVoh9pg7pO+4pwLl2Rnd55u4k8="} } }, raw: Response { inner: Response { status: 403, version: HTTP/1.1, headers: {"x-amz-request-id": "SWTXZ1G45FAPX245", "x-amz-id-2": "u38tMil6E44zlnHeAzwWTC563SjLYx/feAajo5CITzkAPseDauVoh9pg7pO+4pwLl2Rnd55u4k8=", "content-type": "application/xml", "transfer-encoding": "chunked", "date": "Tue, 19 Apr 2022 14:50:13 GMT", "server": "AmazonS3", "connection": "close"}, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>SWTXZ1G45FAPX245</RequestId><HostId>u38tMil6E44zlnHeAzwWTC563SjLYx/feAajo5CITzkAPseDauVoh9pg7pO+4pwLl2Rnd55u4k8=</HostId></Error>")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } } request_id=208

Is credentials_process not supported via the official rust aws-sdk?

spencergilbert commented 2 years ago

Is credentials_process not supported via the official rust aws-sdk?

https://github.com/vectordotdev/vector/issues/12271 :/

breathe commented 2 years ago

Ahh .. bummer -- thanks for the link. I'm upvoting that issue!

jszwedko commented 2 years ago

Apologies @breathe . We missed that credentials_process also isn't supported by the new AWS SDK. We'll be adding a note to the docs.

We do have an upcoming feature in Vector to load secrets from an external process that could be used as a workaround: #11985

jszwedko commented 2 years ago

Hi @breathe . Just a note that credentials_process support added to the new AWS SDK. It'll be available in the next Vector release: 0.23.0.