lest / prometheus-rpm

Prometheus RPM Packages
Apache License 2.0
248 stars 104 forks source link

Services start before networking is up #255

Closed karlism closed 4 years ago

karlism commented 4 years ago

We've run into scenario, where exporters are starting before networking is up during the system startup, and are failing to bind to network address:

Aug 11 14:06:23 hostname node_exporter[1024]: time="2020-08-11T14:06:23Z" level=fatal msg="listen tcp 172.16.XXX.XXX:9100: bind: cannot assign requested address" source="node_exporter.go:172"
Aug 11 14:06:23 hostname systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:06:23 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Scheduled restart job, restart counter is at 1.
Aug 11 14:06:24 hostname node_exporter[1059]: time="2020-08-11T14:06:24Z" level=fatal msg="listen tcp 172.16.XXX.XXX:9100: bind: cannot assign requested address" source="node_exporter.go:172"
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Scheduled restart job, restart counter is at 2.
Aug 11 14:06:24 hostname node_exporter[1064]: time="2020-08-11T14:06:24Z" level=fatal msg="listen tcp 172.16.XXX.XXX:9100: bind: cannot assign requested address" source="node_exporter.go:172"
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Scheduled restart job, restart counter is at 3.
Aug 11 14:06:24 hostname node_exporter[1110]: time="2020-08-11T14:06:24Z" level=fatal msg="listen tcp 172.16.XXX.XXX:9100: bind: cannot assign requested address" source="node_exporter.go:172"
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:06:24 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Scheduled restart job, restart counter is at 4.
Aug 11 14:06:25 hostname node_exporter[1157]: time="2020-08-11T14:06:25Z" level=fatal msg="listen tcp 172.16.XXX.XXX:9100: bind: cannot assign requested address" source="node_exporter.go:172"
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Service RestartSec=100ms expired, scheduling restart.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Scheduled restart job, restart counter is at 5.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Start request repeated too quickly.
Aug 11 14:06:25 hostname systemd[1]: node_exporter.service: Failed with result 'exit-code'.
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5034] hostname: hostname: using hostnamed
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5036] hostname: hostname changed from (none) to "hostname"
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5038] dns-mgr[0x5595c4895250]: init: dns=none,systemd-resolved rc-manager=unmanaged
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5155] Loaded device plugin: NMTeamFactory (/usr/lib64/NetworkManager/1.22.8-5.el8_2/libnm-device-plugin-team.so)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5156] manager: rfkill: Wi-Fi enabled by radio killswitch; enabled by state file
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5160] manager: rfkill: WWAN enabled by radio killswitch; enabled by state file
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5164] manager: Networking is enabled by state file
Aug 11 14:06:25 hostname dbus-daemon[965]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.8' (uid=0 pid=1012 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5171] dhcp-init: Using DHCP client 'internal'
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5252] settings: Loaded settings plugin: ifcfg-rh ("/usr/lib64/NetworkManager/1.22.8-5.el8_2/libnm-settings-plugin-ifcfg-rh.so")
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5254] settings: Loaded settings plugin: keyfile (internal)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5318] device (lo): carrier: link connected
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5367] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5457] manager: (ens192): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5535] device (ens192): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.5678] device (ens192): carrier: link connected
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6096] device (ens192): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6152] policy: auto-activating connection 'System ens192' (03da7500-2101-c722-2438-d0d006c28c73)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6233] device (ens192): Activation: starting connection 'System ens192' (03da7500-2101-c722-2438-d0d006c28c73)
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6254] device (ens192): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6299] manager: NetworkManager state is now CONNECTING
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6320] device (ens192): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6633] device (ens192): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6690] device (ens192): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6748] device (ens192): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6774] device (ens192): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6810] manager: NetworkManager state is now CONNECTED_LOCAL
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6899] manager: NetworkManager state is now CONNECTED_SITE
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6910] policy: set 'System ens192' (ens192) as default for IPv4 routing and DNS
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6925] device (ens192): Activation: successful, device activated.
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.6984] manager: NetworkManager state is now CONNECTED_GLOBAL
Aug 11 14:06:25 hostname NetworkManager[1012]: <info>  [1597154785.7050] manager: startup complete

Services have Restart=on-failure systemd parameter, but default value of RestartSec parameter is only 100ms, and they are restarted way too quickly. I suggest increasing value of RestartSec to 5 seconds, to avoid this issue.

karlism commented 4 years ago

Resolved by https://github.com/lest/prometheus-rpm/pull/256