vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.26k stars 1.6k forks source link

Vector Does Not Properly Reload Updated AWS Credentials #18591

Open hillmandj opened 1 year ago

hillmandj commented 1 year ago

A note for the community

Problem

We have a process in which secrets (files) for AWS are updated via a sidecar. The vector documentation states:

If your AWS credentials expire, Vector will automatically search for up-to-date credentials in the places (and order) described above.

We have attempted various changes in configuration such as setting auth.credentials_file to point to the file path that gets updated as indicated here. Regardless, we still got errors that the token had expired for our aws_s3 sink.

Also, as a separate experiment, we went into a pod that had credential files recently refreshed, and manually triggered a HUP signal to see if vector would pick it up and it failed to do so. The only way we could get vector to pick up the new creds was to issue a TERM signal. We're handling this now by issuing a TERM signal to vector whenever we detect a file change for the credentials via a separate container/process, which seems wrong.

In a separate open issue, it is mentioned that Vector should be able to pick this up via SIGHUP and that AWS's credential_process configuration should help, but we have not seen this work. Everything with respect to the AWS secrets is pretty standard. The contents of the secret directory look like this:

/vector# ls -la /secrets/aws/
total 28
drwxr-xr-x 2 1000 root 180 Sep 11 15:24 .
drwxrwxrwt 3 root root 200 Sep 11 15:24 ..
-rw-r--r-- 1 1000 root  72 Sep 11 15:24 config
-rw-r--r-- 1 1000 root 546 Sep 11 15:24 credentials
-rw-r--r-- 1 1000 root 581 Sep 11 15:24 credentials.json
-rw-r--r-- 1 1000 root  20 Sep 11 15:24 IAM_AWS_ACCESS_KEY_ID
-rw-r--r-- 1 1000 root  40 Sep 11 15:24 IAM_AWS_SECRET_ACCESS_KEY
-rw-r--r-- 1 1000 root 416 Sep 11 15:24 IAM_AWS_SESSION_TOKEN
-rw-r--r-- 1 1000 root  78 Sep 11 15:24 metadata

The credentials file itself looks like this:

/vector# cat /secrets/aws/credentials
[default]
aws_access_key_id=[redacted]
aws_secret_access_key=[redacted]
aws_session_token=[redacted]

And the configuration file looks like this:

/vector# cat /secrets/aws/config
[profile default]
credential_process = cat /secrets/aws/credentials.json

The credentials.json file which is used in credential_process follows the AWS specification, so cat should yield the result we expect based on the AWS documentation:

{
  "Version": 1,
  "AccessKeyId": "an AWS access key",
  "SecretAccessKey": "your AWS secret access key",
  "SessionToken": "the AWS session token for temporary credentials", 
  "Expiration": "ISO8601 timestamp when the credentials expire"
}  

One thing to note is this all began happening after we upgraded to vector 0.31.0 (we are on a later version of vector now but the problem still persists). Is there something we're missing here? It seems like this is all standard and vector should be able to handle changes to credential files automatically. Thanks!

Configuration

sinks:
  security_logs:
    type: aws_s3
    inputs:
      - security_output_cleanup
    bucket: [redacted]
    key_prefix: [redacted]
    compression: gzip
    region: [redacted]
    encoding:
      codec: json
    healthcheck:
      enabled: false
    batch:
      max_bytes: 20971520
      max_events: 500
      timeout_secs: 300
    buffer:
      type: disk
      max_size: 5000000000 #5GB

Version

0.32.1

Debug Output

No response

Example Data

No response

Additional Context

The AWS credentials.json file contents, notice the expiration date:

/vector# cat /secrets/aws/credentials.json
{"AccessKeyId":"[REDACTED]","SecretAccessKey":"[REDACTED]","SessionToken":"[REDACTED]","Expiration":"2023-08-17T03:04:06Z","Version":1}

Execution of a manual SIGHUP on the same pod, notice the timestamp as being earlier than the expiration date above:

{"timestamp":"2023-08-16T14:23:47.304714Z","level":"INFO","message":"Signal received.","signal":"SIGHUP","target":"vector::signal"}
{"timestamp":"2023-08-16T14:23:47.381175Z","level":"INFO","message":"Datadog API key provided. Integration with Datadog Observability Pipelines is enabled.","target":"vector::config::enterprise"}
{"timestamp":"2023-08-16T14:23:47.381227Z","level":"INFO","message":"Reloading running topology with new configuration.","target":"vector::topology::running"}
{"timestamp":"2023-08-16T14:23:47.405028Z","level":"INFO","message":"Attempting to report configuration to Datadog Observability Pipelines.","target":"vector::config::enterprise"}
{"timestamp":"2023-08-16T14:23:47.406248Z","level":"INFO","message":"Running healthchecks.","target":"vector::topology::running"}
{"timestamp":"2023-08-16T14:23:47.406465Z","level":"INFO","message":"New configuration loaded successfully.","target":"vector::topology::running"}
{"timestamp":"2023-08-16T14:23:47.406689Z","level":"INFO","message":"Starting journalctl.","target":"vector::sources::journald","span":{"component_id":"systemd","component_kind":"source","component_name":"systemd","component_type":"journald","name":"source"},"spans":[{"component_id":"systemd","component_kind":"source","component_name":"systemd","component_type":"journald","name":"source"}]}
{"timestamp":"2023-08-16T14:23:47.445069Z","level":"INFO","message":"Vector has reloaded.","path":"[File(\"config.yaml\", Some(Yaml))]","target":"vector"}
{"timestamp":"2023-08-16T14:23:47.482756Z","level":"INFO","message":"Vector config 097fd97c001853214b2a44269051bdf7ef482e905c45ec678d3fb781d12cd064 successfully reported to Datadog Observability Pipelines.","target":"vector::config::enterprise"}
{"timestamp":"2023-08-16T14:27:45.023107Z","level":"ERROR","message":"Non-retriable error; dropping the request.","error":"service error","internal_log_rate_limit":true,"target":"vector::sinks::util::retries","span":{"request_id":110,"name":"request"},"spans":[{"component_id":"security_logs","component_kind":"sink","component_name":"security_logs","component_type":"aws_s3","name":"sink"},{"request_id":110,"name":"request"}]}
{"timestamp":"2023-08-16T14:27:45.023214Z","level":"ERROR","message":"Service call failed. No retries or retries exhausted.","error":"Some(ServiceError(ServiceError { source: PutObjectError { kind: Unhandled(Unhandled { source: Error { code: Some(\"ExpiredToken\"), message: Some(\"The provided token has expired.\"), request_id: Some(\"[redacted]\"), extras: {\"s3_extended_request_id\": \"[redacted]\"} } }), meta: Error { code: Some(\"ExpiredToken\"), message: Some(\"The provided token has expired.\"), request_id: Some(\"[redacted]\"), extras: {\"s3_extended_request_id\": \"[redacted]"} } }, raw: Response { inner: Response { status: 400, version: HTTP/1.1, headers: {\"x-amz-request-id\": \"[redacted]\", \"x-amz-id-2\": \"[redacted]\", \"content-type\": \"application/xml\", \"transfer-encoding\": \"chunked\", \"date\": \"Wed, 16 Aug 2023 14:27:44 GMT\", \"server\": \"AmazonS3\", \"connection\": \"close\"}, body: SdkBody { inner: Once(Some(b\"<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<Error><Code>ExpiredToken</Code><Message>The provided token has expired.</Message><Token-0>[redacted]</Token-0><RequestId>[redacted]</RequestId><HostId>[redacted]</HostId></Error>\")), retryable: true } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }))","request_id":110,"error_type":"request_failed","stage":"sending","internal_log_rate_limit":true,"target":"vector_common::internal_event::service","span":{"request_id":110,"name":"request"},"spans":[{"component_id":"security_logs","component_kind":"sink","component_name":"security_logs","component_type":"aws_s3","name":"sink"},{"request_id":110,"name":"request"}]}
{"timestamp":"2023-08-16T14:27:45.023287Z","level":"ERROR","message":"Events dropped","intentional":false,"count":1,"reason":"Service call failed. No retries or retries exhausted.","internal_log_rate_limit":true,"target":"vector_common::internal_event::component_events_dropped","span":{"request_id":110,"name":"request"},"spans":[{"component_id":"security_logs","component_kind":"sink","component_name":"security_logs","component_type":"aws_s3","name":"sink"},{"request_id":110,"name":"request"}]}

References

12585

hillmandj commented 1 year ago

@jszwedko opened this ticket up as promised, do you have any insights? Thanks!

jszwedko commented 1 year ago

Thanks @hillmandj !

So, it is expected that Vector isn't reloading the credentials via credentials_process when a SIGHUP is issued. the credentials_process mechanism is wholly owned by the AWS Rust SDK. It is responsible for reloading the credentials when they are close to expiration. It sounds like you are observing the credentials not being reloaded automatically though?

hillmandj commented 1 year ago

@jszwedko, yes. The credentials were not loaded automatically and we had "ExpiredToken" errors that would occur with every event until we issued a SIGTERM signal. After that the new pod was able to pick up the credentials. To be clear, it may have "attempted" to load credentials, but failed to stop the "ExpiredToken" errors when it came to our aws_s3 sink

jszwedko commented 1 year ago

Thanks for the details @hillmandj ! I think we'll need to try to reproduce and troubleshoot this one.

hillmandj commented 1 year ago

Another potential related issue: https://github.com/vectordotdev/vector/issues/7013

Radcriminal commented 8 months ago

@hillmandj have you figured this out? I am familiar with aws-sdk for go and there is no function inside that able to recognize credentials expiration. I bet the same is with rust sdk

Kotzyk commented 3 months ago

Any success here? We're having the exact same issue and I see the duplicates closed in December with no updates since.

Anything that would refresh the file on a SIGHUP would be wonderful 🙏

avacaru commented 2 months ago

I'm wondering if this is related:

ERROR vector::topology::builder: msg="Healthcheck failed." error=Invalid credentials component_kind="sink" component_type="aws_s3" component_id=my-sink

I'm using Web Identity Token in a kubernetes pod via IAM roles for service accounts.

The funny thing is that the events are created in the bucket, but the healthcheck fails??

L.E.: I found out why the healthcheck fails, but the logs are sunk to the S3 bucket: the AWS policy requires ListBucket for the healthcheck. https://github.com/vectordotdev/vector/discussions/19792