vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.44k stars 1.52k forks source link

Vector do not refresh workload identity on time. #20231

Open infor7 opened 5 months ago

infor7 commented 5 months ago

A note for the community

Problem

Hi!

I am returning with issue similar to https://github.com/vectordotdev/vector/issues/18432

In the patch to that issue you have added the retrials on 401. Data is properly send to bucket but every hour vector use old expired token for random period of time. Some of them is as long as 18 minutes. For logs it is not problem but we also calculate near real time metrics from the data. When the data is on hold for 18 mins we see huge drops on metrics.

Workload Identity tokens lifecycle looks like below: Token 1 is created (expiry time 3599 s) -> Token 1 expiry time lowers to 300 s and Token 2 is issued (expiry time 3599 s) -> Token 1 expires when it reach 0 s and Token 2 is still usable (expiry time 3299 s) -> Token 2 expiry time lowers to 300 s and Token 3 is issued (expiry time 3599 s) -> and so on

To ensure that is not GCP Workload Identity issue I run few tests.

I've logged into pod which is experiencing 401 problem and run artificial test to send data to bucket. On first test I've generated token, get expiry time and send data using the token. After every file creation I make request to GCP to refresh token. Files was created using old one. The purpose of that test was to check if old token breaks when new token is issued. When old token have 300 seconds to expiry GCP creates new token and both of them should be valid. Using below script i properly created files in the bucket. I confirmed that WI works as expected. Old token allows to save file even after 120 seconds after it's expiry. After that time only new token works.

BUCKET=<bucket>
RESPONSE=`curl -s http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"`

TOKEN=`echo $RESPONSE | jq ".access_token"`
EXPIRY=`echo $RESPONSE | jq ".expires_in"`
EPOCH_EXPIRY=$(( $EXPIRY + `date +%s` ))

echo $TOKEN
echo $EXPIRY

while true
do
#REGENERATE TOKEN
curl -s http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"

#SEND DATA
curl -X PUT --header "x-goog-storage-class: STANDARD" --header "content-type: application/json" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity" --header "authorization:  Bearer $TOKEN" -d "$(( $EPOCH_EXPIRY  - `date +%s` ))" "https://storage.googleapis.com/$BUCKET/expiry/`date +%s`.log";

sleep 1
done

For second test I run nearly similar setup. This time I try to confirm that new token is working from beginning - when we have 2 valid tokens old and new one. I logged into pod which have problems again and run script below. On bucket I see files with initial expiry time of new token(3599), so that proves the new tokens are usable right away.

BUCKET=<bucket>
while true
do
RESPONSE=`curl -s http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token --header "metadata-flavor: Google" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity"`

TOKEN=`echo $RESPONSE | jq ".access_token"`
EXPIRY=`echo $RESPONSE | jq ".expires_in"`
EPOCH_EXPIRY=$(( $EXPIRY + `date +%s` ))

echo $TOKEN
echo $EXPIRY
echo $EPOCH_EXPIRY

#SEND DATA
curl -X PUT --header "x-goog-storage-class: STANDARD" --header "content-type: application/json" --header "user-agent: Vector/0.24.0 (x86_64-unknown-linux-gnu 43267b9 2022-08-30)" --header "accept-encoding: identity" --header "authorization:  Bearer $TOKEN" -d "$(( $EPOCH_EXPIRY  - `date +%s` ))" "https://storage.googleapis.com/$BUCKET/expiry/`date +%s`.log";

sleep 1
done

I think that is problem with vector - workload identity works like a charm. Workaround with restarting it every hour mitigates issue completely but it created another problem with extensive amount of datapoints in our monitoring systems.

Could you decrease time between workload identity token refresh in vector?

Configuration

"sinks":
  "analytics_bucket":
    "batch":
      "max_bytes": 134217728
      "timeout_secs": 60
    "bucket": "<bucket_name>"
    "compression": "none"
    "encoding":
      "codec": "text"
    "inputs":
      - "analytics_enhance"
    "key_prefix": "download-proxy/"
    "type": "gcp_cloud_storage"
"sources":
  "analytics_logs":
    "data_dir": "/var/analytics"
    "exclude":
      - "*.gz"
      - "*.zip"
      - "*.swp"
    "include":
      - "/var/analytics/**/*.log"
      - "/var/analytics/**/*.jsonl"
      - "/var/log/application/v2_hdfs*/**/*.log"
      - "/var/log/application/v2_hdfs*/**/*.jsonl"
    "type": "file"
"transforms":
  "analytics_enhance":
    "inputs":
      - "analytics_logs"
    "source": |
      new_log = {}

      # Apply standard tags
      .az_name = "${AZ_NAME:-unknown_az}"
      .region = "${REGION:-unknown_region}"

      event = {}
      # special handling for v2_hdfs_logs
      if contains(string!(.file), "v2_hdfs_logs") {
          matches = parse_regex!(.file, r'.*/v2_hdfs_logs_(?P<topic>[a-zA-Z0-9_-]+).log')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          # These events are expected to be url-encoded with prefix
          # Prefix is unneeded, and is discarded
          this = split!(del(.message), " ")
          event, err = parse_query_string(this[-1])
          if err != null {
              # Invalid event
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      } else {
          matches = parse_regex!(.file, r'.*/(?P<topic>[a-zA-Z0-9_-]+).(log|jsonl)')
          new_log._autolog_topic = get(matches, ["topic"]) ?? "unknown-topic"

          this = del(.message)
          # Try to parse as json then url-encoded query string
          event, err = parse_json(this) ?? parse_query_string(this)
          if err != null {
              log(err, level: "warn", rate_limit_secs: 60)
              abort
          }
      }

      # Add each field of the message to the new event using percent encoding
      for_each(object!(event)) -> |key, value| {
          encoded_value = encode_percent(to_string!(value))
          new_log = set!(new_log, path: [key], data: encoded_value)
      }

      # Overwrite the event with our new log event, in url-encoded format
      . = encode_key_value(new_log, field_delimiter:"&")

Version

vector 0.34.1 (x86_64-unknown-linux-gnu 86f1c22 2023-11-16 14:59:10.486846964)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

infor7 commented 5 months ago

Below you can find some logs with ~18 mins of retrying

2024-04-03T08:59:18.819110Z  INFO source{component_kind="source" component_id=analytics_logs component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/var/analytics/download-proxy-84d5cb9b9-87676/download_proxy_service.log
2024-04-03T08:59:18.819188Z  INFO source{component_kind="source" component_id=analytics_logs component_type=file}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/analytics/download-proxy-84d5cb9b9-87676/download_proxy_service.log
2024-04-03T09:02:28.669163Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:02:29.963661Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:02:42.182268Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 4 times.
2024-04-03T09:02:42.182289Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:02:50.502963Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:03:03.796070Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 1 times.
2024-04-03T09:03:03.796091Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:03:25.083008Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:03:28.650273Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:03:36.869362Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 4 times.
2024-04-03T09:03:36.869388Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:03:42.183341Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:03:50.506986Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 1 times.
2024-04-03T09:03:50.507011Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:03:59.381079Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:04:03.913691Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 1 times.
2024-04-03T09:04:03.913711Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:04:25.201811Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:04:54.714999Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:04:59.490240Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] is being suppressed to avoid flooding.
2024-04-03T09:05:54.724402Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Internal log [Retrying after response.] has been suppressed 1 times.
2024-04-03T09:05:54.724426Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:06:24.015422Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:07:24.020481Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:08:48.357790Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:09:48.313974Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:12:41.748196Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:13:41.689694Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:18:59.120244Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7550}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:19:58.923110Z  WARN sink{component_kind="sink" component_id=analytics_bucket component_type=gcp_cloud_storage}:request{request_id=7551}: vector::sinks::util::retries: Retrying after response. reason=unauthorized internal_log_rate_limit=true
2024-04-03T09:30:09.686386Z  INFO source{component_kind="source" component_id=analytics_logs component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/var/analytics/download-proxy-84d5cb9b9-87676/download_proxy.log
2024-04-03T09:30:09.686495Z  INFO source{component_kind="source" component_id=analytics_logs component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=/var/analytics/download-proxy-84d5cb9b9-87676/download_proxy_service.log