logdna / logdna-agent-v2

The blazingly fast, resource efficient log collection client
https://logdna.com
MIT License
63 stars 46 forks source link

Agent v3.7.0 gradually "loses" log files #441

Open andrew-barnett opened 1 year ago

andrew-barnett commented 1 year ago

We are running logdna-agent v3.7.0 on a k8s cluster with 3 nodes. As the agent process runs it gradually loses track of log files and eventually stops exporting.

For example, we have a specific service launched on our cluster that is running on a specific node as a specific container. The logdna-agent logs, as of almost 24 hours later (so 2022-11-03T14:00:00Z), that mention this container are as follows:

logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::cache] watching "/var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log"
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::cache] watching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256"
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::cache] watching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z WARN  fs::cache] watch descriptor for /var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log already exists...
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::tail] initialize event for symlink /var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log, final target /var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::tail] initialized symlink "/var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log" as DefaultKey(52v1)
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::tail] initialize event for symlink /var/log/pods/default_nats-streaming-ft-0_bb7ab5d1-f9c9-45d6-a71c-0f1ad973db76/nats-streaming/0.log, final target /var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log
logdna-agent-qvshv.log:[2022-11-02T14:54:10Z INFO  fs::tail] initialize event for file /var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log
logdna-agent-qvshv.log:[2022-11-02T14:55:29Z INFO  fs::cache] unwatching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256"
logdna-agent-qvshv.log:[2022-11-02T14:55:29Z INFO  fs::cache] unwatching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T14:55:29Z INFO  fs::cache] ignoring "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log.1"
logdna-agent-qvshv.log:[2022-11-02T14:55:29Z INFO  fs::cache] watching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T14:55:29Z INFO  fs::tail] added "/var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log"
logdna-agent-qvshv.log:[2022-11-02T15:12:12Z INFO  fs::cache] unwatching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T15:12:12Z INFO  fs::cache] watching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T15:12:12Z INFO  fs::tail] added "/var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log"

..."unwatching => watching => added" cycle repeats 6 more times, about every 3 to 5 minutes...

logdna-agent-qvshv.log:[2022-11-02T16:00:45Z INFO  fs::cache] watching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"
logdna-agent-qvshv.log:[2022-11-02T16:00:45Z INFO  fs::tail] added "/var/log/containers/nats-streaming-ft-0_default_nats-streaming-7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256.log"
logdna-agent-qvshv.log:[2022-11-02T16:04:17Z INFO  fs::cache] unwatching "/var/lib/docker/containers/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256/7c1f911ad110114df6d7e74bbe94816e24aeca7d28ea62d7f200bf059d791256-json.log"

So -- logdna-agent stopped watching this file as of 2022-11-02T16:04:17Z and then never picked it back up. I've logged in to the node and confirmed that this log file still exists and has lines in it from after this time -- almost 24 hours worth of lines.

The instability appears to have started with the log rotation process that happened at 2022-11-02T14:55:29Z, though it does seem the agent recovered just fine.

The k8s node is running Ubuntu 20.04.2 LTS (GNU/Linux 5.8.0-1041-aws x86_64). Disk, memory and cpu are all well below max.

gjanco commented 1 year ago

..."unwatching => watching => added" cycle repeats 6 more times, about every 3 to 5 minutes...

Are your logs rotating this frequently?

andrew-barnett commented 1 year ago

Per Mezmo support, they noticed an issue handling symlink files in 3.6 and 3.7 and requested we downgrade to 3.5. That worked for us. They just recently released 3.8 which was to have fixed this issue. We're testing 3.8 in our environment to check.