vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.98k stars 1.59k forks source link

File to GCS event dropped #19988

Open braden00 opened 8 months ago

braden00 commented 8 months ago

A note for the community

Problem

Use vector to send logs file to GCP Cloud Storage without any issues for a month. But recently got an error below and some log was disappeared. Expected behavior: Retry sending to GCS in case of failure and no log was dropped

error1

{
  "span": {
    "name": "request",
    "request_id": 607
  },
  "error": "Failed to make HTTP(S) request: error writing a body to connection: Connection reset by peer (os error 104)",
  "message": "Unexpected error type; dropping the request.",
  "target": "vector::sinks::util::retries",
  "internal_log_rate_limit": true,
  "spans": [
    {
      "component_kind": "sink",
      "name": "sink",
      "component_id": "start_event",
      "component_type": "gcp_cloud_storage"
    },
    {
      "name": "request",
      "request_id": 607
    }
  ],
  "level": "ERROR",
  "timestamp": "2024-02-27T13:22:46.637798Z"
}

error2

{
  "span": {
    "request_id": 607,
    "name": "request"
  },
  "stage": "sending",
  "message": "Service call failed. No retries or retries exhausted.",
  "request_id": 607,
  "timestamp": "2024-02-27T13:22:46.643160Z",
  "spans": [
    {
      "name": "sink",
      "component_id": "start_event",
      "component_type": "gcp_cloud_storage",
      "component_kind": "sink"
    },
    {
      "name": "request",
      "request_id": 607
    }
  ],
  "internal_log_rate_limit": true,
  "error": "Some(CallRequest { source: hyper::Error(BodyWrite, Os { code: 104, kind: ConnectionReset, message: \"Connection reset by peer\" }) })",
  "target": "vector_common::internal_event::service",
  "level": "ERROR",
  "error_type": "request_failed"
}

error3

{
  "message": "Events dropped",
  "timestamp": "2024-02-27T13:22:46.643221Z",
  "count": 491,
  "target": "vector_common::internal_event::component_events_dropped",
  "internal_log_rate_limit": true,
  "reason": "Service call failed. No retries or retries exhausted.",
  "intentional": false,
  "span": {
    "name": "request",
    "request_id": 607
  },
  "spans": [
    {
      "component_id": "start_event",
      "name": "sink",
      "component_kind": "sink",
      "component_type": "gcp_cloud_storage"
    },
    {
      "request_id": 607,
      "name": "request"
    }
  ],
  "level": "ERROR"
}

Configuration

data_dir: "/usr/share/vector/data"
sources:
  logs:
    type: "file"
    include:
      - "/log/*"
    ignore_older_secs: 7200
    offset_key: offset
  vector_metrics:
    type: "internal_metrics"
transforms:
  jsonParse:
    type: remap
    inputs:
      - logs
    source: |-
      .message = parse_json!(.message)
  eventFilter:
    type: filter
    inputs:
      - jsonParse
    condition: '.message.s_event=="START"'
sinks:
  prometheus:
    type: prometheus_exporter
    inputs:
      - vector_metrics
  start_event:
    type: gcp_cloud_storage
    inputs:
      - eventFilter
    bucket: <GCS BUCKET>
    encoding:
      codec: json
    batch:
      max_bytes: 268435488
      max_events: 40000
      timeout_secs: 30
    buffer:
      type: disk
      max_size: 2684354880
      when_full: block
    key_prefix: "start/date=%F/"
    filename_extension: ndjson
    framing:
      method: newline_delimited

Version

v0.34.0

Debug Output

No response

Example Data

No response

Additional Context

vector is running in GKE

References

No response

ghub-rn-1000 commented 7 months ago

bumping this issue as I'm also experiencing sporadic connection reset errors:

vector[2643]: 2024-04-01T14:07:20.823904Z ERROR sink{component_kind="sink" component_id=gcs1-kafka-out component_type=gcp_cloud_storage component_name=gcs1-kafka-out}:request{request_id=3192}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=Some(CallRequest { source: hyper::Error(Io, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) }) request_id=3192 error_type="request_failed" stage="sending" internal_log_rate_limit=true

I've tried multiple vector versions from 0.33 -> 0.36

it would be helpful for vector to retry "connection reset by peer" errors. for instance, the Java client library appears to include them in its retry policy:

https://cloud.google.com/storage/docs/retry-strategy#java