vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.74k stars 1.57k forks source link

Vector attempts to watch logs of non existent containers #8881

Open Mitsuwa opened 3 years ago

Mitsuwa commented 3 years ago

Vector Version

timberio/vector:nightly-2021-06-27-distroless-libc

Vector Configuration File

We use the standard docker_logs source and communicate to the docker socket

      # https://vector.dev/docs/reference/sources/docker_logs/
      # Collects all docker logs on host through /var/run/docker.sock
      [sources.docker]
        type = "docker_logs"
        exclude_containers = [list of excluded containers]

Debug Output

Expected Behavior

Vector should not attempt to watch logs of container_ids that no longer exist

Actual Behavior

While vector is running, a very short lived pod (that actually crash loops) starts.

Vector will show it starts watching this log

Aug 25 18:25:35.783  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2
Aug 25 18:25:35.783  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2

Watching docker ps -a I can see the pod exited N minutes ago, however ONCE docker ps -a no longer shows any containers for 62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2

We will begin seeing Communication Errors on that container_id

Aug 25 18:34:58.973 ERROR vector::internal_events::docker_logs: Internal log [Error in communication with Docker daemon.] has been rate limited 4432 times.
Aug 25 18:34:58.973 ERROR vector::internal_events::docker_logs: Error in communication with Docker daemon. error=DockerResponseNotFoundError { message: "{\"message\":\"No such container: 62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2\"}\n" } container_id=Some("62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2") internal_log_rate_secs=10
Aug 25 18:34:58.973 ERROR vector::internal_events::docker_logs: Internal log [Error in communication with Docker daemon.] is being rate limited.

You will also see in the vector logs that it is spamming Start/Stop watching error messages for that container id (and other container id's that have had the same problem before).

Its as if vector is not refreshing its list of ContainerIDs it should be monitoring and this communication errors just pile up

Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=9396ebf07903e56f00377175c5f1253942267e64e259db00144dea18cd545b33
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=9729d1b5085421a4f92c7e6ee39cdc83bd2cb64d479047935424fcad862380fa
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=9729d1b5085421a4f92c7e6ee39cdc83bd2cb64d479047935424fcad862380fa
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=62a39bfabee39106099e23cc5a5db6df97098cbcd4a9282b73eb51f16046b4a2
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Stopped watching for container logs. container_id=1bc6ac9eab36b841ae694762a29ec90e955dd7363206ea7209e9a6f6960a792e
Aug 25 19:24:53.790  INFO vector::internal_events::docker_logs: Started watching for container logs. container_id=1bc6ac9eab36b841ae694762a29ec90e955dd7363206ea7209e9a6f6960a792e

These communication errors do not seem to ever stop until we restart the vector container

Example Data

Additional Context

Vector runs as a daemonset in kubernetes. I have not noticed this issue from any other crash looped container. Perhaps it is related to how shortlived the container is?

The pod itself has multiple containers running in it and it is only 1 of the multiple containers that are crashlooping, in other words the pod is at like a 5/6 count of containers being ready if that makes a difference

References

hkrutzer commented 1 year ago

Also found this, when simply upgrading a container.