vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.14k stars 1.6k forks source link

Splunk_Hec_Logs Sink Error: "Service call failed. No retries or retries exhausted. / Events Dropped" error messages #20338

Open csongpaxos opened 7 months ago

csongpaxos commented 7 months ago

A note for the community

Problem

We are trying to send our kubernetes_logs into Splunk via the splunk_hec_logs sink. Some of these logs are sending correctly and arriving in Splunk. However, we are seeing pod logs with error messages related to a service call failing / no retries or retries exhausted error, and events being dropped when sending to the Splunk HEC endpoint.

error message:

2024-04-19T00:01:34.531938Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=1521}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=1521 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-04-19T00:01:34.531962Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=1521}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=567 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

There are no other error messages in the pod logs prior to this one that could give us a clue as to why the service call keeps on failing. From the Splunk side of things, there are no errors for this HEC token, so there are no clues there. We have hit a wall with debugging since there are actually no logs in any of the pods, even with log level set to DEBUG / TRACE, and with the RUST_BACKTRACE set to full.

We would like to see how else to troubleshoot this and if it's a known problem with the Splunk sink. Other things we've tried are increasing batch / buffer size, retry timeout, ack timeout, etc. but none of these config settings appear to resolve the problem.

Configuration

vector:
  enabled: true
  role: "Agent"
  image:
    repository: 111111111111.dkr.ecr.us-east-1.amazonaws.com/mirror/docker.io/timberio/vector
    tag: 0.37.1-debian
    pullPolicy: IfNotPresent
  service:
    ports:
    - name: api
      port: 8686
      protocol: TCP
      targetPort: 8686
  logLevel: "debug"
  env:
    - name: "RUST_BACKTRACE"
      value: "full"
  customConfig:
    data_dir: /var/lib/vector
    sources:
      all_kubernetes_logs:
        type: kubernetes_logs
        read_from: end
    transforms:
      trimMetadataForSplunk:
        type: remap
        inputs: ["all_kubernetes_logs"]
        source: |
          .container_id = .kubernetes.container_id
          .container_name = .kubernetes.container_name
          .pod_namespace = .kubernetes.pod_namespace
          del(.kubernetes)
          del(.timestamp_end)
          del(.stream)
          del(.source_type)
          .ts = .timestamp
    sinks:
      s3:
        type: aws_s3
        inputs:
          - trimMetadataForSplunk
        bucket: test-bucket
        key_prefix: |-
          {{ printf "%s{{ pod_namespace }}%s" "AWSLogs/111111111111/staging_kubernetes_logs/" "/%Y/%m/%d/" }}
        acl: bucket-owner-full-control
        encoding:
          codec: json
        compression: gzip
        region: us-east-1
      splunk:
        type: splunk_hec_logs
        inputs:
          - trimMetadataForSplunk
        compression: none
        endpoint: https://http-inputs-redacted.splunkcloud.com
        endpoint_target: event
        index: test
        default_token: XXX
        sourcetype: |-
          {{ print "{{ container_name }}" }}
        encoding:
          codec: json
        batch:
          timeout_secs: 5
        buffer:
          max_events: 10000

Version

0.37.1-debian

Debug Output

2024-04-18T23:53:41.682623Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
2024-04-18T23:53:41.682640Z DEBUG hyper::proto::h1::io: flushed 16194 bytes
2024-04-18T23:53:41.690330Z DEBUG hyper::proto::h1::io: parsed 8 headers
2024-04-18T23:53:41.690348Z DEBUG hyper::proto::h1::conn: incoming body is content-length (39 bytes)
2024-04-18T23:53:41.690374Z DEBUG hyper::proto::h1::conn: incoming body completed
2024-04-18T23:53:41.690403Z DEBUG sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=729}:http: hyper::client::pool: pooling idle connection for ("https", http-inputs-redacted.splunkcloud.com)
2024-04-18T23:53:41.690432Z DEBUG sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=729}:http: vector::internal_events::http_client: HTTP response. status=200 OK version=HTTP/1.1 headers={"content-type": "application/json; charset=UTF-8", "date": "Thu, 18 Apr 2024 23:53:41 GMT", "server": "Splunkd", "vary": "Authorization", "x-content-type-options": "nosniff", "x-frame-options": "SAMEORIGIN", "content-length": "39", "connection": "keep-alive"} body=[39 bytes]
2024-04-18T23:53:41.690493Z DEBUG vector::sinks::splunk_hec::common::acknowledgements: Stored ack id. ack_id=130
2024-04-18T23:53:41.690523Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=728}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] is being suppressed to avoid flooding.
2024-04-18T23:53:41.690551Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=728}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.
2024-04-18T23:53:42.259531Z DEBUG sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=730}:http: vector::internal_events::http_client: Sending HTTP request. uri=https://http-inputs-redacted.splunkcloud.com/services/collector/event method=POST version=HTTP/1.1 headers={"content-type": "application/json", "authorization": Sensitive, "x-splunk-request-channel": "4e6335f5-7839-4c4d-8043-9a1cab20ea44", "accept-encoding": "gzip", "user-agent": "Vector/0.37.1 (x86_64-unknown-linux-gnu cb6635a 2024-04-09 13:45:06.561412437)"} body=[471783 bytes]
2024-04-18T23:53:42.259588Z DEBUG sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=730}:http: hyper::client::pool: reuse idle connection for ("https", http-inputs-redacted.splunkcloud.com)
2024-04-18T23:53:42.259766Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
2024-04-18T23:53:42.259831Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
2024-04-18T23:53:42.259857Z DEBUG hyper::proto::h1::io: flushed 16384 bytes
2024-04-18T23:53:42.259879Z DEBUG hyper::proto::h1::io: flushed 16384 bytes

Example Data

No response

Additional Context

Vector is running in our EKS cluster on AWS. The s3 sink works fine with no errors, but the splunk sink is showing periodic errors with no additional detailed messages. I've reached out in the Vector discord channel and a developer mentioned "I would have expected to see another error before the retries exhusted error" - however we are not seeing anything to work with in debugging.

References

No response

jszwedko commented 7 months ago

Thanks for filing this @csongpaxos . Normally there is another ERROR level log before the "service call failed" logs that indicates the error that is not being retried. It sounds like you aren't seeing that though?

csongpaxos commented 7 months ago

Hi @jszwedko, no we have tried the "debug" and "trace" log levels, enabling internal metrics / internal logs, and tweaking buffering/batching settings, but haven't been able to get any additional errors indicating the "real" error. Just stuck now with how to proceed / further debug without anything to work with

Not sure if you have seen this before with the splunk sink specifically when log volume is high?

metalbreeze commented 7 months ago

I have same error on nginx log->vector->vector->clickhouse but other Scenarios are good

  1. nginx log->vector->vector centor-> output.file1,
  2. nginx log->vector-> output.file2.
  3. nginx log->vector->clickhouse PS: output.file1 and output.file2 are matched.

here is error msg:

2024-04-21T03:33:37.216528Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 404 Not Found" internal_log_rate_limit=true
2024-04-21T03:33:37.216581Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=1 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-04-21T03:33:37.216609Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true

here is the vector1 and vector2 config

data_dir: "/var/lib/vector"

api:
  enabled: true
  address: "0.0.0.0:8686"

sources:
  nginx_logs:
    type: "file"
    include:
      - "/var/log/nginx/*.log" # supports globbing
    ignore_older_secs: 86400     # 1 day

transforms:
  nginx_parser:
    inputs:
      - "nginx_logs"
    type: "remap"
    source: |
      .message=parse_nginx_log!(.message,"combined")
      .body=.message
  vc_parser:
    inputs:
      - nginx_parser
    type: "remap"
    source: |
      .body.src="vc"
  ck_parser:
    inputs:
      - nginx_parser
    type: "remap"
    source: |
      .body.src="ck"
  file_parser:
    inputs:
      - nginx_parser
    type: "remap"
    source: |
      .body.src="file"

sinks:
  my_vector:
    type: vector
    inputs:
      - vc_parser
    address: 192.168.111.25:6000
  clickhouse:
    type: "clickhouse"
    database : "signoz_logs"
    table : "access_logs_2"
    inputs:
      - ck_parser
    skip_unknown_fields : true
    endpoint:
      "http://testvector2:8123"
  my_file:
    type: file
    inputs:
      - file_parser
    path: /opt/vector/logs/vector-%Y-%m-%d.log
    encoding:
      codec: logfmt
sources:
  my_source_id:
    type: "vector"
    address: "0.0.0.0:6000"
    version: "2"
sinks:
  clickhouse_2:
    type: "clickhouse"
    database : "signoz_logs"
    table : "access_log_2"
    inputs:
      - my_source_id
       #format: json_each_row
    skip_unknown_fields : true
    endpoint:
      "http://192.168.111.25:8123"
  my_sink_id:
    type: file
    inputs:
       - my_source_id
    path: /opt/vector/logs/vector-%Y-%m-%d.log
    encoding:
      codec: logfmt
metalbreeze commented 7 months ago

fixed by adding following batch: timeout_secs: 30

  clickhouse_2:
    type: "clickhouse"
    database : "signoz_logs"
    table : "access_logs_2"
    inputs:
      - my_source_id
       #format: json_each_row
    skip_unknown_fields : true
    batch:
        timeout_secs: 30
andywatts commented 7 months ago

@jszwedko. The error message has field "Error=None"..?

error=None request_id=1 error_type="request_failed" stage="sending"

Here's a link to the error in trace logs

csongpaxos commented 7 months ago

Adding the

    batch:
        timeout_secs: 30

to my Splunk sink appears to not make a difference - still seeing the same service call failed / retries exhausted error with no additional errors.

2024-04-22T17:55:32.314290Z DEBUG hyper::client::pool: pooling idle connection for ("https", http-inputs-XXX.splunkcloud.com)
2024-04-22T17:55:32.314359Z DEBUG vector::sinks::splunk_hec::common::acknowledgements: Stored ack id. ack_id=118
2024-04-22T17:55:32.314400Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 12 times.
2024-04-22T17:55:32.314413Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=555 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-04-22T17:55:32.314437Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] has been suppressed 12 times.
2024-04-22T17:55:32.314443Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=626 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
Cbeck527 commented 6 months ago

Adding more info in case it's helpful: I was seeing this error with our setup, which is almost identical to OP... k8s pods → DataDog agent → Vector → Splunk HEC. I was seeing some events flow into splunk, but wasn't able to figure out any kind of pattern for the errors.

In playing around with the settings, the error has disappeared when I disabled acknowledgements on the sink:

...
splunk_eks:
  type: splunk_hec_logs
  endpoint: "${SPLUNK_CLOUD_HTTP_ENDPOINT}"
  default_token: "${SPLUNK_CLOUD_TOKEN}"
  acknowledgements:
    enabled: false
    indexer_acknowledgements_enabled: false
...

perhaps the service call is related to the acknowledgement piece? Best I can tell, our volume of events is the same if ACKs are on or off... so either the events were always getting there (and the error is on the ACK), or they were never getting there.

To add another data point, we're only seeing this error on 1 of our 7 clusters. Vector is set up identically everywhere, with the only difference being the default_token used.