rancher / os

Tiny Linux distro that runs the entire OS as Docker containers
https://rancher.com/docs/os/v1.x/en/
Apache License 2.0
6.44k stars 655 forks source link

IPv6 default gateway not set, sometimes #1975

Open robbertkl opened 7 years ago

robbertkl commented 7 years ago

RancherOS Version: (ros os version) v1.0.3

Where are you running RancherOS? (docker-machine, AWS, GCE, baremetal, etc.) Cloud VPS (qemu)

I'm having a cloud-config like this:

rancher:
  network:
    interfaces:
      eth0:
        addresses:
          - xxx.xxx.xxx.xxx/24
          - xxxx:xxxx:xxxx:xxxx::xxxx/48
        gateway: xxx.xxx.xxx.xxx
        gateway_ipv6: xxxx:xxxx:xxxx:xxxx::xxxx

Unfortunately, the IPv6 default route is not always present after a reboot. Sometimes it is, sometimes it isn't.

Please note I just started running with kernel boot parameter ipv6.autoconf=0, which effectively turns off my link-local address on eth0, since I only want to use my manually set IPv6 address for outgoing traffic. I have to set this as a kernel boot parameter instead of net.ipv6.conf... sysctl, because RancherOS does not apply sysctls before initializing the network. However, this issue was already present before I started using ipv6.autoconf=0.

robbertkl commented 7 years ago

I'm unable to reproduce this on a local VM, in a non-IPv6 environment. Perhaps the presence of an IPv6 router causes the default route to get removed after being set? Even though autoconf is off for the interface, accept_ra is still on (but not sure if it's doing anything when autoconf is off).

robbertkl commented 7 years ago

After some more testing with my IPv6 cloud machine, I can confirm this indeed seems to be because of router advertisements. During the boot process, a fe80::something default route gets added, which disappears soon after (not sure why). When the cloud-config.yml gets processed, it probably tries to add the default route which fails because there already is one. It will fail in silence because of: https://github.com/rancher/os/blob/7615c26f44c3f88530a7e76b3ae5867a7cfb8bf8/netconf/netconf_linux.go#L327. When the router-advertised route then disappears, no default route is left.

Is there a reason the 2nd parameter to SetGateway is set to true so it will only add the gateway if there is no default gateway yet? Also, the code here (https://github.com/rancher/os/blob/7615c26f44c3f88530a7e76b3ae5867a7cfb8bf8/netconf/netconf_linux.go#L412) removes even IPv6 addresses, even though it's only meant for IPv4 DHCP setting.

I'm willing to work on a PR to clean this up a bit, keeping IPv6 in mind. Can you let me know if this would make sense, @SvenDowideit ?

Unfortunately, I can't turn off accept_ra before the network starts, because RancherOS applies sysctl later in the process (see #1175 / #1539). Therefore, my only solution seems to be:

write_files:
  - path: /opt/rancher/bin/start.sh
    permissions: "0755"
    owner: root
    content: |
      #!/bin/bash
      sysctl -w net.ipv6.conf.eth0.accept_ra=0
      ip -6 route del default
      ip -6 route add default via xxxx:xxxx:xxxx:xxxx::xxxx dev eth0
SvenDowideit commented 7 years ago

@robbertkl yes, a PR would be very helpful - I don't have IPv6 here, and so haven't got experience with it :/

robbertkl commented 7 years ago

Actually, I think some of the changes I would like to make (like changing the true to false like described in my previous message) would break backward compatibility / introduce different behaviour, which is why a PR probably won't go through.

As I understand the code actually tries to revert DHCP when it's turned off, but was already initialised earlier, instead of never being started in the first place. Not very elegant, but I guess it works when the network has to be initialised before the cloud-config.yml can be processed. This is controlled by the dhcp: bool setting, which can be turned off using a rancher.* kernel boot parameter, to prevent DHCP being started in the first place.

For IPv6, however, there usually is no setting, and no "dhcp daemon", since this is controlled by sysctls. Using the sysctl values (or introducing a new cloud-config setting that sets it) seems not the right way, but perhaps this is the only way since the network is already initialized at the time the cloud-config gets processed. What do you think? Should it mimic the IPv4 behaviour by changing the sysctls, removing already autoconfigured addresses and/or routes, then adding the new ones?

To make matters more complicated, instead of having DHCP set both IP address and gateway, they are separate with IPv6 (autoconf and accept_ra sysctls).

SvenDowideit commented 7 years ago

yeah, we do need to bring up the network before we probe for / get the cloud-init file :/

but - do we need to bring up the IPv6 network then? Does anyone use IPv6 to get a network cloud-init.yml?

robbertkl commented 7 years ago

Well, yes, it's no use to improve IPv6 support in one area and cut it out of another. With forced IPv6-only around the corner, it's not wise to assume everyone uses IPv4 to get a network cloud-init.yml.

ngdio commented 6 years ago

The workaround didn't work for me, I "solved" the problem by adding this to the cloud-config:

rancher:
  network:
    post_cmds:
    - "ip -6 route del default"
    - "ip -6 route add default via [GATEWAY] dev eth0"

Your solution did not work for me because when the startup scripts are executed, the network might not be ready yet, so in my case, the changes were not actually applied.