kubernetes_logs source possibly missing logs in AKS/containerd cluster

DC-NunoAl commented 3 years ago

Vector Version

0.14.0

Vector Configuration File

sources: 
  k8s:
    type: kubernetes_logs
    rawConfig: |
      glob_minimum_cooldown_ms = 500

Expected Behavior

Vector should see and log all data that the multiple pods are logging to stdout.

Actual Behavior

Vector misses some logs that are created by the pods.

Additional Context

We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue. We've also found out that our AKS containerd cluster structures the logs as follows:

/var/log/pods/wld2_xload-zb8gs_d9fbd436-8dda-468f-83e3-c8c8ef42b4dd/xload# ls
0.log 
0.log.20210618-133010.gz 
0.log.20210618-133020.gz 
0.log.20210618-133031.gz 
0.log.20210618-133041

What we think it's happening is the following:

Vector starts reading 0.log
The file gets "full"
The file gets rotated to 0.log.DATE
The file 0.log gets cleared (empty)

We think there might be two problems:

With the completely empty file in 4. , it might mess up vector check pointing system.
While vector is working, between step 1. and 4., vector loses some data that was deleted once the 4. executes.

alvaroaleman commented 3 years ago

I think we are hitting the same issue, I noticed today that some logs are not shipped at all. It is from a pod that produces a great deal of logs, they get rotated by the container engine about every 30 minutes. Restarting vector makes the logs get shipped again, but that is obviously not a solution.

spencergilbert commented 3 years ago

I'll be seeing what I can do to replicate this today/over the weekend. Feel free to share any configs that may be relevant (and not defaults) - thanks!

alvaroaleman commented 3 years ago

This is on openshift, so cri-o is the container runtime. The log directory on the node looks like this:

sh-4.4# ls /var/log/pods/ci_tide-757f9cc7d6-7457t_aedf9209-8e5f-4b1f-8015-63408a6d4622/tide
0.log  0.log.20210618-141411.gz  0.log.20210618-142945.gz  0.log.20210618-144638.gz  0.log.20210618-150251

Config looks like this:

    # file: vector.toml
    # Configuration for vector-agent
    # Docs: https://vector.dev/docs/
    # Set global options
    data_dir = "/var/lib/vector"

    # Ingest logs from Kubernetes
    [sources.kubernetes]
      type = "kubernetes_logs"

    # captures services in the ci namespace
    [transforms.ci-services]
      type = "filter"
      inputs = ["kubernetes"]
      condition.type = "check_fields"
      condition."kubernetes.pod_namespace.eq" = "ci"
      condition."kubernetes.container_name.neq" = "test"
      condition."kubernetes.container_name.neq" = "sidecar"
      condition."kubernetes.container_name.neq" = "namespace-ttl-controller" # generates enormous amounts of logs

    # parse the json so we can filter
    [transforms.json_parsed]
      type = "json_parser"
      inputs = ["ci-services"]
      drop_invalid = true

    [sinks.aws_cloudwatch_logs]
      type = "aws_cloudwatch_logs"
      inputs = ["json_parsed"]
      group_name = "app-ci-pod-logs"
      region = "us-east-1"
      stream_name = "{{ kubernetes.pod_node_name }}"
      encoding.codec = "json"

    # Release-controller logs
    [transforms.release_controllers]
      type = "filter"
      inputs = ["kubernetes"]
      condition.type = "check_fields"
      condition."kubernetes.pod_namespace.eq" = "ci"
      condition."kubernetes.pod_name.starts_with" = "release-controller-"
      condition."kubernetes.container_name.eq" = ["controller"]

    [sinks.rc_aws_cloudwatch_logs]
      type = "aws_cloudwatch_logs"
      inputs = ["release_controllers"]
      group_name = "app-ci-logs/{{ kubernetes.pod_labels.app }}"
      region = "us-east-1"
      stream_name = "{{ kubernetes.pod_name }}"
      encoding.codec = "text"

Vector version: vector:0.11.1-alpine (I'll see that I bump that today).

Thanks for looking into this!

spencergilbert commented 3 years ago

We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue.

@DC-NunoAl are these pods very short living? The cooldown there was originally added to improve performance by spending more time processing events compared to looking for files. We've made some changes to how that works internally and lowering that may be reasonable without impacting performance at this point.

Based on my initial testing, it appears as though vector is following the files through the rotation process here. Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl ?

alvaroaleman commented 3 years ago

Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl

At the time I noticed this, we didn't have any shipped logs for the past 12 hours and deleting the vector pod to get it re-created made the shipping happen so I doubt that's it. I just bumped vector in https://github.com/openshift/release/pull/19445 and right now there are logs but the "repro" is "wait a couple of hours" so that might be expected. I will check again later and update here.

In case that helps, here are the last 25k loglines of the app where I noticed that (and probably noteworthy: It didn't seem to happen for others (but I believe this one emits more logs than everything else)): https://gist.github.com/alvaroaleman/e8e57e62435fcf62b04ba5498e5e1857

spencergilbert commented 3 years ago

@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire kubernetes_logs source not just that particularly chatty pod?

I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection

alvaroaleman commented 3 years ago

Do you have an estimated line/s that controller is generating?

The 25k line sample started at 17:13:54Z and ended at 17:19:01Z so that makes it about 69 message/second. It is very bursty though, so peak is likely much higher. There are logs from other components on the node at the same time.

spencergilbert commented 3 years ago

I'm generating events with flog so I should be able to crank up the throughput to something similar to what you're seeing

portante commented 3 years ago

File based scraping is always going to be susceptible to these kinds of imbalances. Typically, you need a buffer that can help mitigate bursty logs and / or slow sinks.

However, if the log scraper is not given enough resources to keep up, or if it is slowed down by other behaviors (like unnecessary statx/openat/read/close sequences, or unneeded synchronization, futex calls, etc.) then it just won't keep up with the log generation rate. Log file rotation then will cause logs to be "dropped" is very large chunks.

Remember with Kubernetes and CRI-O, each container has 2 pipes being read sequentially and written to disk (see https://github.com/containers/conmon/issues/262). So the combined bandwidth of all the conmon processes can be a gating factor. Hopefully, folks move their logging disk to a separate individual disk so as not to destabilize the node as well.

DC-NunoAl commented 3 years ago

We're currently doing performance testing to see if vector would be a good solution for our logging system. In the course of this test we found out that some logs of our logging stress test pods were missing. After investigation, we found out that changing glob_minimum_cooldown_ms to 500 ms instead of the default 4000 ms alleviates this issue.

@DC-NunoAl are these pods very short living? The cooldown there was originally added to improve performance by spending more time processing events compared to looking for files. We've made some changes to how that works internally and lowering that may be reasonable without impacting performance at this point.

Based on my initial testing, it appears as though vector is following the files through the rotation process here. Are the logs here unique or could we be seeing the same log line emitted @alvaroaleman, @DC-NunoAl ?

Our current testing strategy is having a pod send 200k messages in 5 minutes - the pod is considered "complete" after about 5 minutes. We've tested this with 10 minutes to (400k messages) and the result was the same.

Our pod generates unique logs with different timestamps (and vector is shipping them to elastic search with a very very simple pipeline).

@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire kubernetes_logs source not just that particularly chatty pod?

I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection

In our case, our average logs per second in this test is 667 messages/s (we want to achieve at least 2500/s). This vector is only looking at this one pod. Monday we'll try to have more pods and see if the entire source stops or if vector still catches logs from some pods.

Regarding the glob_minimum_cooldown_ms = 500, it helped us go from 50k messages in 5 minutes with 0 logs lost to 500k messages in 5 minutes.

One of our typical log messages looks something like this:

{
    "@timestamp": "2021-06-18T16:36:12.6533278+01:00",
    "level": "Info",
    "message": "2021-06-18 14:44:13.0166|INFO",
    "root": {
        "appdomain": "",
        "hostname": "",
        "machinename": "",
        "logid": "1476",
        "loguuid": "4c84422e8c724944ab354e94128bc19b",
        "environment": "pd01",
        "environmentalias": "wld",
        "servicename": "",
        "target": "adhoc"
    },
    "data": {
        "Description": "BulkLogResend",
        "Details": {
            "ActionType": "processed",
            "MessageType": "GreenPipes.DynamicInternal.Contracts.Events",
            "Message": {
                "a": "1f75e0f1-61a6-4fea-b7f9-75b031b27a6a",
                "b": "23f710d7-6d03-4d1f-809f-d1a0fccfe927",
                "c": "268d392e-34c2-4fde-b1a9-2be8d1be1bfd",
                "d": "61c82519-3bde-3a17-a911-8b2983d750a2",
                "e": "b33fdafe-e35d-4781-985b-49674f664b43",
                "f": "f0302809-97f8-4931-9a73-ecec871d0e3b",
                "g": "",
                "h": "127.0.0.1",
                "MessageTimestamp": "2021-06-18T14:44:12.8201996Z",
                "SubmittedResult": {
                    "metadata": {
                        "isMatch": "Y"
                    }
                }
            },
            "ApplicationName": "Service",
            "MessageId": "87c35bd5-c7b8-44ce-8f3f-1b6482e7013d",
            "ConversationId": "31200000-784e-2818-71ef-08d932678a04",
            "SourceAddress": "/production/w3wp_bus_groyyydajawbomdubdcurwptyc",
            "DestinationAddress": "/production/Contracts.Events",
            "TimeFrames": {
                "SentAt": "2021-06-18T14:44:12.820333Z",
                "ReceivedAt": "2021-06-18T14:44:12.8439199Z",
                "FinishedAt": "2021-06-18T14:44:13.0163676Z",
                "TimeToReceiveMs": 23.5869,
                "TimeOfProcessingMs": 172.4477,
                "MessageLifetimeMs": 196.0346
            }
        }
    }
}

DC-NunoAl commented 3 years ago

@alvaroaleman thanks for the update! Do you have an estimated line/s that controller is generating? When the pipeline appeared to have stalled, it stalled for the entire kubernetes_logs source not just that particularly chatty pod?

I'm wondering if that could be related to backpressure from the sink, as a user I ran into something similar where it seemed like the source just silently stopped running and it turned out to be due to issues at the sink causing vector to slow the event collection

@spencergilbert We've tested if this behavior would stall the entire kubernetes_logs source as follows:

5 pods logging a constant number of messages | 20k messages per 5 minutes | expected to succeed
1 pod logging a large constant number of messages | 200k messages per 5 minutes | expected to fail

The outcome was as follows: Elastic Graph 01

This behavior seems to indicate that the source does not get stalled. The spikes on the graph seem to be vector catching some logs from the pod that logs 200k messages. We tested this with the default glob_minimum_cooldown_ms.

We also tested the same test with glob_minimum_cooldown_ms: 500 which mitigated the problem. Elastic Graph 02

spencergilbert commented 3 years ago

File based scraping is always going to be susceptible to these kinds of imbalances. Typically, you need a buffer that can help mitigate bursty logs and / or slow sinks.

However, if the log scraper is not given enough resources to keep up, or if it is slowed down by other behaviors (like unnecessary statx/openat/read/close sequences, or unneeded synchronization, futex calls, etc.) then it just won't keep up with the log generation rate. Log file rotation then will cause logs to be "dropped" is very large chunks.

Remember with Kubernetes and CRI-O, each container has 2 pipes being read sequentially and written to disk (see containers/conmon#262). So the combined bandwidth of all the conmon processes can be a gating factor. Hopefully, folks move their logging disk to a separate individual disk so as not to destabilize the node as well.

See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.

Thanks for the info @portante - that lucidchart looks to be private though

portante commented 3 years ago

See also https://lucid.app/lucidchart/1c4192bb-4eef-4a3b-a1db-9293342ca7e2/edit?page=XK1FUROkUunC#.

Thanks for the info @portante - that lucidchart looks to be private though

Sorry, this is the public sharing link: https://lucid.app/lucidchart/invitations/accept/inv_2c2874c7-a202-4fa9-872d-e54705de517a

spencergilbert commented 3 years ago

@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source

alvaroaleman commented 3 years ago

@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source

So since I upgraded the vector version to 0.14.X last Friday this didn't happen anymore. Since there is no clear repro it is hard to say if whatever caused this was fixed or if it is just coincidence. If it happens again, i'll try glob_minimum_cooldown_ms (and keep watching this issue).

Thanks for all your help!

spencergilbert commented 3 years ago

@alvaroaleman have you tried adjusting the glob_minimum_cooldown_ms configuration to see if it helps? It's undocumented but can be applied to the kubernetes_logs source

So since I upgraded the vector version to 0.14.X last Friday this didn't happen anymore. Since there is no clear repro it is hard to say if whatever caused this was fixed or if it is just coincidence. If it happens again, i'll try glob_minimum_cooldown_ms (and keep watching this issue).

Thanks for all your help!

That's great to hear - for what it's worth we think the throughput volume was causing the files to be rotated from 0.log to 0.log.DATE before the source globbed for more files to watch, and right now the source is explicitly only looking for *.log in that directory.

vectordotdev / vector