vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.63k stars 1.55k forks source link

aws_s3 Source Failing to assume_role After Update to 0.36.0 #19879

Closed gswai123 closed 7 months ago

gswai123 commented 7 months ago

A note for the community

Problem

We just updated our Vector deployment to use version 0.36.0, and suddenly a pipeline of ours which is utilizing the aws_s3 source with an assume_role which grants it access to an SQS queue and S3 bucket stopped being able to access messages in the SQS queue. This pipeline has been running without issue for 4 months on previous Vector versions (the latest we've used before this was 0.35.0), so it seems like a change introduced in 0.35.1 or later has caused the problem.

Our other pipelines that don't rely on an assume_role are still working just fine with the newest version, which leads us to believe that something has changed with the way that: auth: assume_role: works in the aws_s3 source. Does anyone have any ideas? We have a hunch it could be related to this change in 0.35.1, but aren't sure: https://github.com/vectordotdev/vector/commit/c2cc94a262ecf39798009d29751d59cc97baa0c5#diff-d6eef19144b594971c40fce6c9e73777c346036e50277c3874a665ce31bccfb8

Debug logs show an issue we've never seen before re: environment variables, which seems like the culprit: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))

Configuration

our_source_name:
    type: aws_s3
    compression: auto
    region: eu-central-1
    sqs:
      delete_message: true
      queue_url: source_sqs_queue_url
    auth:
      assume_role: assume_role_arn
      region: "us-east-1"

### Version

vector 0.36.0

### Debug Output

```text
2024-02-14T21:56:29.202231Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: loaded credentials provider=WebIdentityToken
2024-02-14T21:56:29.202336Z DEBUG assume_role: aws_config::sts::assume_role: retrieving assumed credentials
2024-02-14T21:56:29.202384Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Environment context=the credential provider was not enabled: environment variable not set (CredentialsNotLoaded(CredentialsNotLoaded { source: "environment variable not set" }))
2024-02-14T21:56:29.202401Z DEBUG assume_role:provide_credentials{provider=default_chain}: aws_config::meta::credentials::chain: provider in chain did not provide credentials provider=Profile context=the credential provider was not enabled: No profiles were defined (CredentialsNotLoaded(CredentialsNotLoaded { source: NoProfilesDefined }))

Example Data

No response

Additional Context

Here are the error logs we're seeing related to this issue: vector::internal_events::aws_sqs: Failed to fetch SQS events. error=dispatch failure error_code="failed_fetching_sqs_events" error_type="request_failed" stage="receiving" internal_log_rate_limit=true

References

No response

gswai123 commented 7 months ago

For the time being, I've rolled back to Vector version 0.35.0 for this pipeline, and that has resolved the issue. I would like to be able to use some of the features in 0.36.0, though, so it would be nice to get this problem sorted out. Thanks!

jszwedko commented 7 months ago

I asked this in Discord, but asking here too in case others stumble upon this issue and can fill in some of the blanks:

Thanks!

gswai123 commented 7 months ago

Hey @jszwedko! So our Vector pipelines run in Kubernetes, and each one is deployed with a service account that has been configured to assume a main Vector role in AWS that has access to all of the various AWS resources. However, in the case of this specific pipeline, we must assume a role in another account to access the specific S3 bucket that the logs are in. The main Vector AWS role has been granted access to assume this cross-account role and access the logs. This has worked as expected up until this newest update.

Also re: the debug logs: the full set of logs was just that snippet I pasted repeated over and over again. I can try to update the deployment version again to replicate the log set and post more of them if that's useful though. Thanks!

jszwedko commented 7 months ago

Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?

gswai123 commented 7 months ago

Gotcha, thanks! Just to confirm are you deploying in EKS and using https://docs.aws.amazon.com/eks/latest/userguide/pod-configuration.html to configure a service role for each pod?

Exactly!

Stazis555 commented 7 months ago

+1 happens to me as well, cannot verify debug log, but downgrade to 0.35 fixed the issue. The setup is basically the same as gswai123

andibraeu commented 7 months ago

We have the same issues using aws_s3 as sink. Running a vector pod in EKS where the container assumes a role.

StephenWakely commented 7 months ago

@gswai123 @andibraeu @Stazis555 I believe this issue should be fixed in our nightly builds now.

I wasn't getting able to reproduce the exact issue that you were having (evironment variables etc..) but there was a definite bug in the area that has now been fixed.

Would it be possible to try the nightly builds to see if the issue is actually fixed for you?

Let me know if you have any questions.

jszwedko commented 7 months ago

To make the images easier to find, these would be the latest nightly images: https://hub.docker.com/r/timberio/vector/tags?page=1&name=nightly-2024-02-29 . It would be great to have validation that they fix the issue for you before we cut a v0.36.1 release πŸ™

gswai123 commented 7 months ago

@jszwedko @StephenWakely thanks for being so quick to release a fix. I'm testing out the latest nightly build now. I'll let you know if it fixes the issue in a bit!

gswai123 commented 7 months ago

The fix is working for me! We're no longer seeing any errors with the latest nightly build. Thank you!