Open danthegoodman1 opened 2 years ago
I will add that restart vector seems to solve the issue, but that's not a sustainable solution.
It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none.
Hi @danthegoodman1 !
It seems like some of the incoming events are not JSON. I might suggest augmenting your remap
program to log the event that fails to parse as JSON like this:
.temp, err = parse_json(.message)
if err != null {
log("failed to parse .message as JSON: " + .message)
} else {
// message successfully parsed as JSON
}
As an aside, you are currently parsing .message
as JSON twice. I'd recommend making:
if !exists(parse_json!(.message).level) {
instead
f !exists(.temp.level) {
Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through.
Will add the suggestions in the meantime
@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs?
No, that is what is so strange @jszwedko, makes it real hard to debug 🙃
Also getting this, trying to work through why this is
error[E630]: fallible argument
┌─ :3:7
│
3 │ log("failed to parse .message as JSON: " + .message)
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
│ │
│ this expression can fail
│ handle the error before passing it in as an argument
│
= see documentation about error handling at https://errors.vrl.dev/#handling
= see language documentation at https://vrl.dev
Will update with the solution when I find it
Updated to
log("failed to parse .message as JSON: " + .message ?? "no mesage!")
Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P
2022-03-29T18:27:59.052628Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:27:49.577Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:27:49 [DEBUG] memberlist: Initiating push/pull sync with: 01FYC8DZXMP6ENGHGKM4PBK61Q 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
_position=53
2022-03-29T18:27:59.052677Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:12.427827Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:12.427844Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:02 http:
TLS handshake error from 10.0.167.22:11247: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:12.427861Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:17.165804Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
2022-03-29T18:28:17.165830Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:17 http:
TLS handshake error from 10.0.167.22:60721: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:19.164409Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:28:15.586Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:15 [DEBUG] memberlist: Stream connection from=10.0.167.205:34340\\n\"" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:19.164463Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:27.218994Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:27.219014Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:22 http:
TLS handshake error from 10.0.167.22:27699: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:37.171126Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:27 http:
TLS handshake error from 10.0.167.22:34561: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:37.171168Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:38.940324Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new
file to watch. file=/var/log/pods/eth-scanner_eth-scanner-27476308-hbml9_05116106-4a27-49a2-ba6e-c811190e304a/eth-scanner/0.log
2022-03-29T18:28:39.221000Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped w
atching file. file=/var/log/pods/eth-scanner_eth-scanner-27476306-gkzmk_0a02fc81-f83a-4d3a-925b-37febe1dab62/eth-scanner/0.log
2022-03-29T18:28:47.167084Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
2022-03-29T18:28:47.167106Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:47 http:
TLS handshake error from 10.0.167.22:48077: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:47.234267Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:52.426194Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:52.426218Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:52 http:
TLS handshake error from 10.0.167.22:62639: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:59.187233Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:28:49.579Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:49 [DEBUG] memberlist: Initiating push/pull sync with: xxx 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
_position=53
2022-03-29T18:28:59.187428Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs.
@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected)
One thing to add: We checked the mounted /var/logs
and the files with the missing logs are there, it seems like the event about the new files doesn't get triggered. We're running on AWS EKS and /proc/sys/fs/inotify/max_user_watches
is 524288.
@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms?
It's 60000
ms in kubernetes_logs
source, but 1000
ms in file
source.
@wtchangdm will try, but it’s the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s
@danthegoodman1 BTW, there is an issue I am also tracking: https://github.com/vectordotdev/vector/issues/8616. Not sure if it's related, though.
Interesting, but we don’t get any warnings or errors like that does.
@danthegoodman1 could you try with the latest release, 0.21.1? The kubernetes_logs
source was rewritten there to use the community supported kube-rs
crate to interact with Kubernetes that we expect may have fixed a few issues.
Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment.
@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to 8000
, but we also surely have not been in our logs as much.
Will update now and see how it works
@jszwedko This went very not well, was printing this extremely quickly:
2022-04-25T18:40:35.870916Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.871396Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.872182Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.873367Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.874306Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
~Ah ok I needed to update the cluster role tags too :)~ still getting it...
@jszwedko This went very not well, was printing this extremely quickly:
2022-04-25T18:40:35.870916Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.871396Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.872182Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.873367Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.874306Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
The ClusterRole
needs to allow for the list
verb as well now, see here
yep, just made that change :)
After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs.
It is able to ship them for about 30 seconds before it stops sending logs
2022-04-26T13:21:29.462788Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Writing slot 131412285 for consumer solanascanner-1\\\",\\\"time\\\":\\\"2022-04-26T13:21:
29.450Z\\\"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.45062706Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(20
22-04-26T13:21:29.462766592Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definiti
on: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_f
ailed" stage="processing"
2022-04-26T13:21:29.497575Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Wrote offset in 35.128858ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.485Z\\\"}\\n\",\"stre
am\":\"stderr\",\"time\":\"2022-04-26T13:21:29.485822529Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:29.49755
5692Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collec
tion: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processin
g"
2022-04-26T13:21:29.505901Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got slots in 167.388115ms got slots [131412285], len 1\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\
"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502718488Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T
13:21:29.505885264Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Defin
ition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" st
age="processing"
2022-04-26T13:21:29.505972Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got same top slot 131412285, sleeping for 250ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\"}\\n\"
,\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502731568Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:2
9.505966965Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition {
collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="pr
ocessing"
2022-04-26T13:21:29.576577Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/log/pods/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3e
f46c17c9cd/solanascanner/0.log
It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods
Seems related to #10122 (comment)
Possibly - the other answer would be that solanascanner
was deleted in cluster before all of it's log lines were processed. When that happens delay_deletion_ms
comes into play as it determines how long we hold onto old metadata before dumping it ourselves. Increasing that value will increase memory usage (however much it takes to hold onto the resource metadata), but allow for more time to enrich logs from deleted pods.
It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation
Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well
Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well
Can you share the setup you're using so I can reproduce?
dropped the glob cooldwon to 2s and it seems to be holding steady so far, here is the setup, otherwise it is the default k8s deployment:
apiVersion: v1
kind: ConfigMap
metadata:
name: vector
namespace: observability
labels:
app.kubernetes.io/name: vector
app.kubernetes.io/instance: vector
app.kubernetes.io/component: Agent
app.kubernetes.io/version: "0.21.1-distroless-libc"
data:
agent.yaml: |
data_dir: /vector-data-dir
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 2000
host_metrics:
filesystem:
devices:
excludes: [binfmt_misc]
filesystems:
excludes: [binfmt_misc]
mountPoints:
excludes: ["*/proc/sys/fs/binfmt_misc"]
type: host_metrics
internal_metrics:
type: internal_metrics
transforms:
setlevel:
type: remap
inputs: [kubernetes_logs]
source: |-
.temp, err = parse_json(.message)
if err != null {
.json_parse_err = err
} else {
if !exists(.temp.level) {
.level = "other"
} else {
.level = .temp.level
}
if exists(.temp.uri) {
.uri = .temp.uri
}
if exists(.temp.msg) {
.msg = .temp.msg
}
if exists(.temp.lat) {
.lat = .temp.lat
.lon = .temp.lon
}
del(.temp)
}
sinks:
prom_exporter:
type: prometheus_exporter
inputs: [host_metrics, internal_metrics]
address: 0.0.0.0:9090
# stdout:
# type: console
# inputs: [setlevel]
# encoding:
# codec: json
loki:
type: loki
inputs:
- "setlevel"
endpoint: https://logs-prod-us-central1.grafana.net
compression: gzip
# remove_label_fields: true
encoding:
codec: json
auth:
password: ${LOKI_PASSWORD}
user: "${LOKI_USERNAME}"
strategy: basic
labels:
namespace: "{{ kubernetes.pod_namespace }}"
pod: "{{ kubernetes.pod_name }}"
level: "{{ level }}"
app_label: "{{ kubernetes.pod_label.app }}"
node: "{{ kubernetes.pod_node_name }}"
pod_owner: "{{ kubernetes.pod_owner }}"
cluster: ${CLUSTER_NAME}
@danthegoodman1 are you running something like flog
to generate load/logs or is this with your actual applications?
actual app
I am currently experiencing the same issue. We had previously downgraded vector due to the Desync errors described in https://github.com/vectordotdev/vector/issues/8616.
After upgrading to 0.21.0 and then 0.21.1, it seems the Desync errors have completely gone away. However, after a period of time, logs for certain workloads silently stop getting picked up and we see a large increase in vector_component_errors_total
for the kubernetes_logs source with majority of the errors are reader_failed
.
Performing a rolling restart of the vector daemonset always seems to restore things to a better state (indicated by the gaps in the plot below) however after a while (seems to be about once a day or so) the errors return with no real indication aside from the internal metrics.
We have a pretty simple config for the source, and have incorporated the suggestions about glob_min_cooldown_ms
and delay_deletion_ms
described in various issues.
kubernetes_logs:
auto_partial_merge: true
fingerprint_lines: 1
# set to lower value than default 60s
# to better handle high log volume / rotation
# see https://github.com/vectordotdev/vector/issues/10122#issuecomment-977815309
glob_minimum_cooldown_ms: 5000 # 5 seconds
# increase delay deletion to allow vector
# to continue processing logs for deleted/scaled down pods
delay_deletion_ms: 3600000
self_node_name: ${VECTOR_SELF_NODE_NAME}
type: kubernetes_logs
@treethought would you be able to share the logs associated with those errors? They could provide additional clues.
We're experiencing the same problem. The time when the logs stop shipping correlates with annotation_failed
errors:
2022-05-30T04:31:09.092406Z ERROR source{component_kind="source" component_id=kubernetes_logs
component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes_logs:
Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file":
Bytes(b"/var/log/pods/*redacted*/0.log"), "message": Bytes(b"2022-05-30T11:31:09.04624439+07:00 stderr F
*redacted*"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-05-30T04:31:09.092358889Z)}),
metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]),
schema_definition: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } })
error_code="annotation_failed" error_type="reader_failed" stage="processing"
We're experiencing the same problem. The time when the logs stop shipping correlates with
annotation_failed
errors:2022-05-30T04:31:09.092406Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/pods/*redacted*/0.log"), "message": Bytes(b"2022-05-30T11:31:09.04624439+07:00 stderr F *redacted*"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-05-30T04:31:09.092358889Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processing"
What is your delay_deletion_ms
currently configured to?
What is your
delay_deletion_ms
currently configured to?
The default value.
What is your
delay_deletion_ms
currently configured to?The default value.
Has the pod in the error log been deleted/terminated from the cluster? If so you'll want to increase the delay_deletion_ms
to account for processing of logs that may remain after the pod had been terminated.
Has the pod in the error log been deleted/terminated from the cluster? If so you'll want to increase the
delay_deletion_ms
to account for processing of logs that may remain after the pod had been terminated.
It hasn't. The pod was alive for at least an hour after these errors, and no logs were received since that. All caught up after Vector restart.
The same Vector agent continued to work fine with other pods.
Is there any sort of additional telemetry we could enable to help debug this? We're encountering this problem every day.
In the meantime we could probably introduce a liveness probe triggered by error count metric.
Is there any sort of additional telemetry we could enable to help debug this? We're encountering this problem every day.
Definitely seems like we need to, opened https://github.com/vectordotdev/vector/issues/12909
@jszwedko sorry for the delayed response. I don't have any logs associated with the errors at this time, but would like to add that my experience aligns with @aikoven's description. This typically occurs only for a handful of actively running pods, while others continue to have logs picked up fine.
Hi, I faced the same issue on kubernetes
source. I guessed the vector agent never release file descriptor when some log files were deleted.
I got a workaround for it by restarting agents and aggregators.
Is there anyone who know the answer of this problem? or is there any option in kubernetes logs
source like remove_after_secs
in file
source?
I think this is related to #8101.
Hi, I've encountered similar issue vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata.
, and the root cause seemed to be in the pod name, which had the same name as the previous pod which was recreated.
It looks like it deletes the old pod name from the metadata cache, when the timeout hits, but as it has the same name as the newly created pod, then the new one doesn't have the pod name in the metadata cache anymore and it fails with that error.
I think some retry for pod metadata retrieval is necessary, if it's missing in the cache.
Similar sentiment after reading @imcheck and @Genry1 comments - in my case, it happens with a bunch of statefulsets every time there's a rolling upgrade, and of course, the pod name is reused. It might be related to the lock in the file descriptor opened by the vector-agent container at the host level. Cannot think of any WA other than deleting the agent in the node whenever I spot this occurrence?
We also see this issue if a pod is recreated in the same node with the same name. It's pretty much a show stopper for us.
Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: https://github.com/vectordotdev/vector/issues/13467
@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with kubernetes_logs
source currently is 0.19.3
. If we upgrade it to 0.2x.x we immediately start facing with 2 issues:
In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with kubernetes_logs
since there is big risk to lose logs. I'm really hope that root cause of these issues will be identified and fixed in upcoming releases.
@jszwedko is there any updates regarding the issue? Stopping receiving logs from k8s is kinda a critical issue for us.
A note for the community
Problem
After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)
Configuration
Example Data
Additional Context
Only logs from vector:
References
Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634