Etcd-manager is damaging /etc/hosts file

axelbodo commented 5 years ago

We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.

for the consitent master it looks like this:

# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
#     /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 ip-1-2-3-4.ourdomain.pri ip-1-2-3-4
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4      etcd-events-a.internal.example.com
1.2.3.5     etcd-events-b.internal.example.com
1.2.3.6    etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]

# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4      etcd-a.internal.example.com
1.2.3.5     etcd-b.internal.example.com
1.2.3.6    etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]

while on one of the other master, where it is damaged:

r-data
#
127.0.1.1 ip-1-2-3-6.ourdomain.pri ip-1-2-3-6
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4      etcd-a.internal.example.com
1.2.3.5     etcd-b.internal.example.com
1.2.3.6    etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]

# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4      etcd-events-a.internal.example.com
1.2.3.5     etcd-events-b.internal.example.com
1.2.3.6    etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]

As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.

Attaching the 2 host file, and part of kibanlogs we see:

consistent-etc-hosts.txt

damaged-etc-hosts.txt

filtered-kibana-log.txt

axelbodo commented 5 years ago

in hosts.go line 94 and line 210 may happen near the same time in events and main, or even 2 WriteFile (line 210). According to the WriteFile documentation it truncates the content of the file before writing it and it may lead in reading empty or partially created file at line 94, or concurrently writing in the same file, truncating what is already written by the other pod.

axelbodo commented 5 years ago

to avoid race conditions during read/write, I would use os.OpenFile and syscall.Flock, instead of Read/WriteFile as the latter are not data race free. The OpenFile/Flock pair would really guarantee atomic read/write operation.

canadiannomad commented 5 years ago

I'd also like to note that this only really would get noticed in an environment where resolv.conf contains domain/search or DNS is incorrectly configured.

If DNS resolves localhost as 127.0.0.1 then nobody would notice. If DNS resolved localhost as localhost.localsubdomain due to resolv.conf domain localsubdomain/search localsubdomain then gets a non-127.0.0.1 result then it would become noticable (ie calico failing).

kopeio / etcd-manager

Etcd-manager is damaging /etc/hosts file #266