Open DC-NunoAl opened 3 years ago
I think we are hitting the same issue, I noticed today that some logs are not shipped at all. It is from a pod that produces a great deal of logs, they get rotated by the container engine about every 30 minutes. Restarting vector makes the logs get shipped again, but that is obviously not a solution.
I'll be seeing what I can do to replicate this today/over the weekend. Feel free to share any configs that may be relevant (and not defaults) - thanks!
This is on openshift, so cri-o is the container runtime. The log directory on the node looks like this:
sh-4.4# ls /var/log/pods/ci_tide-757f9cc7d6-7457t_aedf9209-8e5f-4b1f-8015-63408a6d4622/tide
0.log 0.log.20210618-141411.gz 0.log.20210618-142945.gz 0.log.20210618-144638.gz 0.log.20210618-150251
Config looks like this:
# file: vector.toml
# Configuration for vector-agent
# Docs: https://vector.dev/docs/
# Set global options
data_dir = "/var/lib/vector"
# Ingest logs from Kubernetes
[sources.kubernetes]
type = "kubernetes_logs"
# captures services in the ci namespace
[transforms.ci-services]
type = "filter"
inputs = ["kubernetes"]
condition.type = "check_fields"
condition."kubernetes.pod_namespace.eq" = "ci"
condition."kubernetes.container_name.neq" = "test"
condition."kubernetes.container_name.neq" = "sidecar"
condition."kubernetes.container_name.neq" = "namespace-ttl-controller" # generates enormous amounts of logs
# parse the json so we can filter
[transforms.json_parsed]
type = "json_parser"
inputs = ["ci-services"]
drop_invalid = true
[sinks.aws_cloudwatch_logs]
type = "aws_cloudwatch_logs"
inputs = ["json_parsed"]
group_name = "app-ci-pod-logs"
region = "us-east-1"
stream_name = "{{ kubernetes.pod_node_name }}"
encoding.codec = "json"
# Release-controller logs
[transforms.release_controllers]
type = "filter"
inputs = ["kubernetes"]
condition.type = "check_fields"
condition."kubernetes.pod_namespace.eq" = "ci"
condition."kubernetes.pod_name.starts_with" = "release-controller-"
condition."kubernetes.container_name.eq" = ["controller"]
[sinks.rc_aws_cloudwatch_logs]
type = "aws_cloudwatch_logs"
inputs = ["release_controllers"]
group_name = "app-ci-logs/{{ kubernetes.pod_labels.app }}"
region = "us-east-1"
stream_name = "{{ kubernetes.pod_name }}"
encoding.codec = "text"
Vector version: vector:0.11.1-alpine (I'll see that I bump that today).
Thanks for looking into this!
We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue.
@DC-NunoAl are these pods very short living? The cooldown there was originally added to improve performance by spending more time processing events compared to looking for files. We've made some changes to how that works internally and lowering that may be reasonable without impacting performance at this point.
Based on my initial testing, it appears as though vector is following the files through the rotation process here. Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl ?
Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl
At the time I noticed this, we didn't have any shipped logs for the past 12 hours and deleting the vector pod to get it re-created made the shipping happen so I doubt that's it. I just bumped vector in https://github.com/openshift/release/pull/19445 and right now there are logs but the "repro" is "wait a couple of hours" so that might be expected. I will check again later and update here.
In case that helps, here are the last 25k loglines of the app where I noticed that (and probably noteworthy: It didn't seem to happen for others (but I believe this one emits more logs than everything else)): https://gist.github.com/alvaroaleman/e8e57e62435fcf62b04ba5498e5e1857
@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire kubernetes_logs
source not just that particularly chatty pod?
I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection
Do you have an estimated line/s that controller is generating?
The 25k line sample started at 17:13:54Z and ended at 17:19:01Z so that makes it about 69 message/second. It is very bursty though, so peak is likely much higher. There are logs from other components on the node at the same time.
I'm generating events with flog so I should be able to crank up the throughput to something similar to what you're seeing
File based scraping is always going to be susceptible to these kinds of imbalances. Typically, you need a buffer that can help mitigate bursty logs and / or slow sinks.
However, if the log scraper is not given enough resources to keep up, or if it is slowed down by other behaviors (like unnecessary statx/openat/read/close sequences, or unneeded synchronization, futex calls, etc.) then it just won't keep up with the log generation rate. Log file rotation then will cause logs to be "dropped" is very large chunks.
Remember with Kubernetes and CRI-O, each container has 2 pipes being read sequentially and written to disk (see https://github.com/containers/conmon/issues/262). So the combined bandwidth of all the conmon
processes can be a gating factor. Hopefully, folks move their logging disk to a separate individual disk so as not to destabilize the node as well.
See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.
We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue.
@DC-NunoAl are these pods very short living? The cooldown there was originally added to improve performance by spending more time processing events compared to looking for files. We've made some changes to how that works internally and lowering that may be reasonable without impacting performance at this point.
Based on my initial testing, it appears as though vector is following the files through the rotation process here. Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl ?
Our current testing strategy is having a pod send 200k messages in 5 minutes - the pod is considered "complete" after about 5 minutes. We've tested this with 10 minutes to (400k messages) and the result was the same.
Our pod generates unique logs with different timestamps (and vector is shipping them to elastic search with a very very simple pipeline).
@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire
kubernetes_logs
source not just that particularly chatty pod?I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection
In our case, our average logs per second in this test is 667 messages/s (we want to achieve at least 2500/s). This vector is only looking at this one pod. Monday we'll try to have more pods and see if the entire source stops or if vector still catches logs from some pods.
Regarding the glob_minimum_cooldown_ms = 500
, it helped us go from 50k messages in 5 minutes with 0 logs lost to 500k messages in 5 minutes.
One of our typical log messages looks something like this:
{
"@timestamp": "2021-06-18T16:36:12.6533278+01:00",
"level": "Info",
"message": "2021-06-18 14:44:13.0166|INFO",
"root": {
"appdomain": "",
"hostname": "",
"machinename": "",
"logid": "1476",
"loguuid": "4c84422e8c724944ab354e94128bc19b",
"environment": "pd01",
"environmentalias": "wld",
"servicename": "",
"target": "adhoc"
},
"data": {
"Description": "BulkLogResend",
"Details": {
"ActionType": "processed",
"MessageType": "GreenPipes.DynamicInternal.Contracts.Events",
"Message": {
"a": "1f75e0f1-61a6-4fea-b7f9-75b031b27a6a",
"b": "23f710d7-6d03-4d1f-809f-d1a0fccfe927",
"c": "268d392e-34c2-4fde-b1a9-2be8d1be1bfd",
"d": "61c82519-3bde-3a17-a911-8b2983d750a2",
"e": "b33fdafe-e35d-4781-985b-49674f664b43",
"f": "f0302809-97f8-4931-9a73-ecec871d0e3b",
"g": "",
"h": "127.0.0.1",
"MessageTimestamp": "2021-06-18T14:44:12.8201996Z",
"SubmittedResult": {
"metadata": {
"isMatch": "Y"
}
}
},
"ApplicationName": "Service",
"MessageId": "87c35bd5-c7b8-44ce-8f3f-1b6482e7013d",
"ConversationId": "31200000-784e-2818-71ef-08d932678a04",
"SourceAddress": "/production/w3wp_bus_groyyydajawbomdubdcurwptyc",
"DestinationAddress": "/production/Contracts.Events",
"TimeFrames": {
"SentAt": "2021-06-18T14:44:12.820333Z",
"ReceivedAt": "2021-06-18T14:44:12.8439199Z",
"FinishedAt": "2021-06-18T14:44:13.0163676Z",
"TimeToReceiveMs": 23.5869,
"TimeOfProcessingMs": 172.4477,
"MessageLifetimeMs": 196.0346
}
}
}
}
@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire
kubernetes_logs
source not just that particularly chatty pod?I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection
@spencergilbert We've tested if this behavior would stall the entire kubernetes_logs
source as follows:
The outcome was as follows:
This behavior seems to indicate that the source does not get stalled.
The spikes on the graph seem to be vector catching some logs from the pod that logs 200k messages.
We tested this with the default glob_minimum_cooldown_ms
.
We also tested the same test with glob_minimum_cooldown_ms: 500
which mitigated the problem.
File based scraping is always going to be susceptible to these kinds of imbalances. Typically, you need a buffer that can help mitigate bursty logs and / or slow sinks.
However, if the log scraper is not given enough resources to keep up, or if it is slowed down by other behaviors (like unnecessary statx/openat/read/close sequences, or unneeded synchronization, futex calls, etc.) then it just won't keep up with the log generation rate. Log file rotation then will cause logs to be "dropped" is very large chunks.
Remember with Kubernetes and CRI-O, each container has 2 pipes being read sequentially and written to disk (see containers/conmon#262). So the combined bandwidth of all the
conmon
processes can be a gating factor. Hopefully, folks move their logging disk to a separate individual disk so as not to destabilize the node as well.See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.
Thanks for the info @portante - that lucidchart looks to be private though
See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.
Thanks for the info @portante - that lucidchart looks to be private though
Sorry, this is the public sharing link: https://lucid.app/lucidchart/invitations/accept/inv_2c2874c7-a202-4fa9-872d-e54705de517a
@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms
configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source
@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source
So since I upgraded the vector version to 0.14.X last Friday this didn't happen anymore. Since there is no clear repro it is hard to say if whatever caused this was fixed or if it is just coincidence. If it happens again, i'll try glob_minimum_cooldown_ms
(and keep watching this issue).
Thanks for all your help!
@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source
So since I upgraded the vector version to 0.14.X last Friday this didn't happen anymore. Since there is no clear repro it is hard to say if whatever caused this was fixed or if it is just coincidence. If it happens again, i'll try
glob_minimum_cooldown_ms
(and keep watching this issue).Thanks for all your help!
That's great to hear - for what it's worth we think the throughput volume was causing the files to be rotated from 0.log
to 0.log.DATE
before the source globbed for more files to watch, and right now the source is explicitly only looking for *.log
in that directory.
Vector Version
Vector Configuration File
Expected Behavior
Vector should see and log all data that the multiple pods are logging to stdout.
Actual Behavior
Vector misses some logs that are created by the pods.
Additional Context
We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing
glob_minimum_cooldown_ms
to 500 ms instead of the default 4000 ms alleviates this issue. We've also found out that our AKS containerd cluster structures the logs as follows:What we think it's happening is the following:
We think there might be two problems: