Open csongpaxos opened 7 months ago
Thanks for filing this @csongpaxos . Normally there is another ERROR
level log before the "service call failed" logs that indicates the error that is not being retried. It sounds like you aren't seeing that though?
Hi @jszwedko, no we have tried the "debug" and "trace" log levels, enabling internal metrics / internal logs, and tweaking buffering/batching settings, but haven't been able to get any additional errors indicating the "real" error. Just stuck now with how to proceed / further debug without anything to work with
Not sure if you have seen this before with the splunk sink specifically when log volume is high?
I have same error on nginx log->vector->vector->clickhouse but other Scenarios are good
here is error msg:
2024-04-21T03:33:37.216528Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector::sinks::util::retries: Not retriable; dropping the request. reason="response status: 404 Not Found" internal_log_rate_limit=true
2024-04-21T03:33:37.216581Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=1 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-04-21T03:33:37.216609Z ERROR sink{component_kind="sink" component_id=clickhouse_2 component_type=clickhouse}:request{request_id=1}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
here is the vector1 and vector2 config
data_dir: "/var/lib/vector"
api:
enabled: true
address: "0.0.0.0:8686"
sources:
nginx_logs:
type: "file"
include:
- "/var/log/nginx/*.log" # supports globbing
ignore_older_secs: 86400 # 1 day
transforms:
nginx_parser:
inputs:
- "nginx_logs"
type: "remap"
source: |
.message=parse_nginx_log!(.message,"combined")
.body=.message
vc_parser:
inputs:
- nginx_parser
type: "remap"
source: |
.body.src="vc"
ck_parser:
inputs:
- nginx_parser
type: "remap"
source: |
.body.src="ck"
file_parser:
inputs:
- nginx_parser
type: "remap"
source: |
.body.src="file"
sinks:
my_vector:
type: vector
inputs:
- vc_parser
address: 192.168.111.25:6000
clickhouse:
type: "clickhouse"
database : "signoz_logs"
table : "access_logs_2"
inputs:
- ck_parser
skip_unknown_fields : true
endpoint:
"http://testvector2:8123"
my_file:
type: file
inputs:
- file_parser
path: /opt/vector/logs/vector-%Y-%m-%d.log
encoding:
codec: logfmt
sources:
my_source_id:
type: "vector"
address: "0.0.0.0:6000"
version: "2"
sinks:
clickhouse_2:
type: "clickhouse"
database : "signoz_logs"
table : "access_log_2"
inputs:
- my_source_id
#format: json_each_row
skip_unknown_fields : true
endpoint:
"http://192.168.111.25:8123"
my_sink_id:
type: file
inputs:
- my_source_id
path: /opt/vector/logs/vector-%Y-%m-%d.log
encoding:
codec: logfmt
fixed by adding following batch: timeout_secs: 30
clickhouse_2:
type: "clickhouse"
database : "signoz_logs"
table : "access_logs_2"
inputs:
- my_source_id
#format: json_each_row
skip_unknown_fields : true
batch:
timeout_secs: 30
@jszwedko. The error message has field "Error=None"..?
error=None request_id=1 error_type="request_failed" stage="sending"
Here's a link to the error in trace logs
Adding the
batch:
timeout_secs: 30
to my Splunk sink appears to not make a difference - still seeing the same service call failed / retries exhausted error with no additional errors.
2024-04-22T17:55:32.314290Z DEBUG hyper::client::pool: pooling idle connection for ("https", http-inputs-XXX.splunkcloud.com)
2024-04-22T17:55:32.314359Z DEBUG vector::sinks::splunk_hec::common::acknowledgements: Stored ack id. ack_id=118
2024-04-22T17:55:32.314400Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::service: Internal log [Service call failed. No retries or retries exhausted.] has been suppressed 12 times.
2024-04-22T17:55:32.314413Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::service: Service call failed. No retries or retries exhausted. error=None request_id=555 error_type="request_failed" stage="sending" internal_log_rate_limit=true
2024-04-22T17:55:32.314437Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] has been suppressed 12 times.
2024-04-22T17:55:32.314443Z ERROR sink{component_kind="sink" component_id=splunk component_type=splunk_hec_logs}:request{request_id=555}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=626 reason="Service call failed. No retries or retries exhausted." internal_log_rate_limit=true
Adding more info in case it's helpful: I was seeing this error with our setup, which is almost identical to OP... k8s pods → DataDog agent → Vector → Splunk HEC. I was seeing some events flow into splunk, but wasn't able to figure out any kind of pattern for the errors.
In playing around with the settings, the error has disappeared when I disabled acknowledgements on the sink:
...
splunk_eks:
type: splunk_hec_logs
endpoint: "${SPLUNK_CLOUD_HTTP_ENDPOINT}"
default_token: "${SPLUNK_CLOUD_TOKEN}"
acknowledgements:
enabled: false
indexer_acknowledgements_enabled: false
...
perhaps the service call is related to the acknowledgement piece? Best I can tell, our volume of events is the same if ACKs are on or off... so either the events were always getting there (and the error is on the ACK), or they were never getting there.
To add another data point, we're only seeing this error on 1 of our 7 clusters. Vector is set up identically everywhere, with the only difference being the default_token
used.
A note for the community
Problem
We are trying to send our
kubernetes_logs
into Splunk via thesplunk_hec_logs
sink. Some of these logs are sending correctly and arriving in Splunk. However, we are seeing pod logs with error messages related to a service call failing / no retries or retries exhausted error, and events being dropped when sending to the Splunk HEC endpoint.error message:
There are no other error messages in the pod logs prior to this one that could give us a clue as to why the service call keeps on failing. From the Splunk side of things, there are no errors for this HEC token, so there are no clues there. We have hit a wall with debugging since there are actually no logs in any of the pods, even with log level set to DEBUG / TRACE, and with the RUST_BACKTRACE set to full.
We would like to see how else to troubleshoot this and if it's a known problem with the Splunk sink. Other things we've tried are increasing batch / buffer size, retry timeout, ack timeout, etc. but none of these config settings appear to resolve the problem.
Configuration
Version
0.37.1-debian
Debug Output
Example Data
No response
Additional Context
Vector is running in our EKS cluster on AWS. The s3 sink works fine with no errors, but the splunk sink is showing periodic errors with no additional detailed messages. I've reached out in the Vector discord channel and a developer mentioned "I would have expected to see another error before the retries exhusted error" - however we are not seeing anything to work with in debugging.
References
No response