Open axelbodo opened 5 years ago
in hosts.go line 94 and line 210 may happen near the same time in events and main, or even 2 WriteFile (line 210). According to the WriteFile documentation it truncates the content of the file before writing it and it may lead in reading empty or partially created file at line 94, or concurrently writing in the same file, truncating what is already written by the other pod.
to avoid race conditions during read/write, I would use os.OpenFile and syscall.Flock, instead of Read/WriteFile as the latter are not data race free. The OpenFile/Flock pair would really guarantee atomic read/write operation.
Related to https://github.com/kopeio/etcd-manager/issues/200
I'd also like to note that this only really would get noticed in an environment where resolv.conf contains domain
/search
or DNS is incorrectly configured.
If DNS resolves localhost as 127.0.0.1 then nobody would notice.
If DNS resolved localhost as localhost.localsubdomain due to resolv.conf domain localsubdomain
/search localsubdomain
then gets a non-127.0.0.1 result then it would become noticable (ie calico failing).
We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.
for the consitent master it looks like this:
while on one of the other master, where it is damaged:
As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.
Attaching the 2 host file, and part of kibanlogs we see:
consistent-etc-hosts.txt
damaged-etc-hosts.txt
filtered-kibana-log.txt