Open raghu999 opened 1 year ago
Thanks for this report @raghu999 ! Did you share the right logs though? The error logs there seem to be showing an error with the disk buffers. Do you have logs showing the panic in the http_server
source?
@jszwedko Updated the issue we are only seeing the buffer error but it is the same dataset that we were seeing an error when using the Splunk HEC source and when we switched to using the HTTP server source we now are seeing a new error that is related to buffers. We confirmed that the disk buffer is not full when this issue happens.
Gotcha, thanks for clarifying @raghu999 ! It sounds more like a bug in the disk buffer implementation than the http
source then.
This error comes from our limitations around the maximum record size allowed in a disk buffer, which is set to 128MB. A record is essentially a chunk of events written all together, which based on the sources used in the given configuration, would be any events decoded from a single request.
This would imply that if the error above is being hit, that a single request sent to either the splunk_hec
or http_server
source is very large, something like at least 90-100MB: each decoded event will have some internal overhead, so the request doesn't need to be 128MB by itself.
Based on your knowledge of the clients sending to your Vector instance, is it possible for requests to be that large (90-100MB)?
Hi everyone! Facing the same issue here:
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.412419Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.414728Z ERROR transform{component_kind="transform" component_id=parsing_transform_new component_type=remap component_name=parsing_transform_new}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
Sep 17 19:40:46 vector-2 vector[633402]: 2023-09-17T19:40:46.415193Z INFO vector: Vector has stopped.
This is the transform:
parsing_transform_new:
type: remap
inputs:
- profile_route.parsing
source: |-
if (springboot_match, err = parse_regex(.message, r'^(?P<timestamp>\d+(?:-\d+){2}\s+\d+(?::\d+){2}\.\d+)\s+(?P<log_level>\S+)\s+(?P<pid>\d+)\s+---\s+\[(?P<thread_name>[\s+\S]*?)\]\s+(?P<logger_name>\S+)\s+:\s+(?s)(?P<message>.*)'); err == null) { # java_springboot
.timestamp = parse_timestamp!(springboot_match.timestamp, "%F %T.%3f")
.log_level = downcase!(springboot_match.log_level)
.pid = to_int!(springboot_match.pid)
.thread_name = strip_whitespace!(springboot_match.thread_name)
.logger_name = springboot_match.logger_name
.message = to_string(springboot_match.message)
} else if (python_match, err = parse_regex(.message, r'^\[(?P<timestamp>\d+-\d+-\d+\s\d+:\d+:\d+,\d+)\s(?P<log_level>\w+)\s(?P<logger_name>\S+)\s(?P<thread_name>\w+\S+\s\S+)]\s(?s)(?P<message>.*)'); err == null) { # python
.timestamp = parse_timestamp!(python_match.timestamp, "%F %T,%3f")
.log_level = downcase!(python_match.log_level)
.logger_name = python_match.logger_name
.thread_name = strip_whitespace!(python_match.thread_name)
.message = to_string(python_match.message)
} else if (vault_match, err = parse_regex(.message, r'^(?P<timestamp>\d+-\d+-\d+T\d+:\d+:\d+.\d+Z)\s\[(?P<log_level>\w+)]\s+(?s)(?P<message>.*)$'); err == null) { # vault
.timestamp = vault_match.timestamp
.log_level = downcase!(vault_match.log_level)
.message = to_string(vault_match.message)
} else {
.malformed = true
}
In my case it is unlikely for requests to be even close to 90-100mb. v0.32.0
We're seeing this error again with a different application. Some relevant log lines
vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2023-11-15T23:45:07.499790Z DEBUG transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2023-11-15T23:45:07.499819Z ERROR transform{component_kind="transform" component_id=remap component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2023-11-15T23:45:07.499874Z INFO vector: Vector has stopped.
2023-11-15T23:45:07.499934Z DEBUG vector_buffers::variants::disk_v2::writer: Writer marked as closed.
2023-11-15T23:45:07.500314Z DEBUG source{component_kind="source" component_id=internal_logs component_type=internal_logs}: vector::topology::builder: Source finished normally.
The source is http_server
. The symptoms is that memory utilization in the pod would grow to the point of being OOM killed. But given the log above 'source finished normally' vector might just have exited?
version we used: v0.34.0
A few questions:
remap
transform is also accessing disk buffer while doing the transform? Synchronous transform finished with an error.
mean (what function call might trigger this error message , or maybe, are all transforms synchronous. what characterizes a "synchronous" transform?)I am seeing similar issues
2024-02-20T09:21:25.663269Z ERROR transform{component_kind="transform" component_id=kubernetes_application component_type=remap component_name=kubernetes_application}: vector::topology: An error occurred that Vector couldn't handle: failed to encode record: BufferTooSmall.
2024-02-20T09:21:25.663344Z INFO vector: Vector has stopped.
2024-02-20T09:21:25.663365Z ERROR transform{component_kind="transform" component_id=route_logs component_type=route component_name=route_logs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.676755Z ERROR transform{component_kind="transform" component_id=containerlogs component_type=remap component_name=containerlogs}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-02-20T09:21:25.680198Z INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="splunk_kube_audit, fluentbit, kubernetes_application, prometheus_exporter, splunk_container_logs, kubernetes_audit, httpd, vector_metrics, kubernetes_infrastructure, route_logs, journal_logs, containerlogs" time_remaining="59 seconds left"
The containers exists upon certain load , around 10k events/sec.
@awangc have You managed to find the root cause or workaround ? I've seen similar issues while sending logs from upstream vector pods that are part of OpenShift and the default batch size is 10MB I believe. I've been testing with sending just generic nginx logs so really small events. Upon certain load the vector would exit with same error related to the buffer. I have observed the issue not happening when the batch size on upstream vector is set to 1MB (so 10 times less).
@jszwedko is there any doc how are the transforms related to a disk buffers ?
@jszwedko is there any doc how are the transforms related to a disk buffers ?
Transforms emit data to an output channel which is then consumed by buffers. Is that the sort of detail you were looking for? Or do you have different sort of question?
For this issue, I know it is a lot to ask, but if anyone is able to create a standalone reproduction, that would aid with debugging.
A note for the community
Problem
We had a similar issue when using the Splunk HEC source and raised a bug report #17670, We started using the HTTP source and are now seeing a buffer error with the HTTP source causing the vector to stop ingesting any new data the containers are entering a restart loop with OOM error on Kubernetes and on vector we are seeing the below error.
K8's container
Vector Error:
Configuration
Version
0.31.x
Debug Output
No response
Example Data
No response
Additional Context
Vector is running in Kubernetes and this specific client has large payloads with close to 6000-8000 fields in their entire dataset.
References
17670 Faced similar issue with splunk hec source we moved to http source and are seeing a new error.