kubernetes-sigs / cluster-api-provider-vsphere

Apache License 2.0
367 stars 292 forks source link

CentOS 7 OVA not accepting static ip #583

Closed MaxRink closed 4 years ago

MaxRink commented 5 years ago

/kind bug

What steps did you take and what happened: Im trying to define my machines to use static IPs. With the Ubuntu template this works fine, but with the centos template the same config with just the template exchanged fails to get its network interface up. The vSphere Machine definition im using:

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha2
kind: VSphereMachine
metadata:
  name: management-cluster-controlplane-0
  namespace: default
spec:
  datacenter: Bremen
  diskGiB: 100
  memoryMiB: 8192
  network:
    devices:
      - dhcp4: false
        gateway4: 10.92.156.1
        ipAddrs:
          - 10.92.159.240/22
        nameservers:
          - 10.145.139.41
          - 10.145.139.42
        dhcp6: false
        networkName: VLAN151
  numCPUs: 4
  template: ubuntu-1804-kube-v1.15.3

What did you expect to happen: The VM boots and gets an IP configured within the cloud-init process

Anything else you would like to add:

Environment:

andrewsykim commented 5 years ago

This is a known issue, please see https://github.com/kubernetes-sigs/image-builder/issues/48

MnrGreg commented 4 years ago

@MaxRink @akutz I found a small workaround for this which includes downing the interface and renaming it. This assumes the CentOS OVA with VMware vmxnet3 starts with interface name ens192. See the preKubeadmCommand below.

apiVersion: bootstrap.cluster.x-k8s.io/v1alpha2
kind: KubeadmConfig
spec:
  preKubeadmCommands:
  - if [[ $(cat /etc/os-release) =~ "CentOS" ]]; then ip link set ens192 down && ip link set ens192 name id0 && ifup id0 && nmcli con up "System id0"; fi

In addition the dhcp4:false flag within the kind: VSphereMachine network spec doesn't seem to work with cloud-init. BOOTPROTO=dhcp within the ifcfg-id0 is not altered. Again, one can manually work around this with the below. Just place it above the previous command so that its instantiated prior to the interface coming up.

apiVersion: bootstrap.cluster.x-k8s.io/v1alpha2
kind: KubeadmConfig
spec:
  preKubeadmCommands:
  - sed -i -e 's/BOOTPROTO=dhcp/BOOTPROTO=static/g' /etc/sysconfig/network-scripts/ifcfg-id0
MnrGreg commented 4 years ago

After some digging it seems that with Cloud-Init Network Config Version 2 the name of Device Configuration ID object is overwriting the ethernet interface device name within CentOS.

Example - Version 2 metadata:

instance-id: "management-cluster-controlplane-0"
network:
  version: 2
  ethernets:
    id0:
      match:
        macaddress: "00:50:56:a5:1a:78"

The results with the above are:

Centos 7.7 - not working:

Ubuntu 1804 - working:

Options to resolve:

  1. Add the set-name: field to the Cloud-Init Network Config Version 2 specification within pkg/util/constants.go. Hardcode the value to "eth" + the device index (eth0) for each device instance.
  2. Add the set-name: field to the Cloud-Init Network Config Version 2 specification within pkg/util/constants.go. Let the field be user configurable and drop if not specified.
  3. Have upstream cloud-init and centos figure it out. ☹️ eish!

Example - Version 2 metadata with set-name:

instance-id: "management-cluster-controlplane-0"
network:
  version: 2
  ethernets:
    id0:
      match:
        macaddress: "00:50:56:a5:1a:78"
      set-name: "eth0"

The results with the above are:

CentOS 7.7 - working:

Ubuntu 1804 - working:

@akutz @andrewsykim what are your thoughts on this?

Option 2 seems least intrusive while also giving more granular control. Should I put a PR together to review?

akutz commented 4 years ago

Hi @MaxRink,

First of all, wow!, what a phenomenal deep-dive into this problem. Kudos to you sir!

Yet, a solution may be simpler than we think...

Both NetworkManager (RHEL) and SystemD Networking use predictable network interface naming, so I'd rather not explicitly fuss with their names. That's why the generated network config doesn't specify a valid name, but rather an ID used to match the MAC address. The fact that this is still causing the device to be renamed on CentOS is definitely troubling.

As an aside, interestingly, this didn't used to happen, so something with netplan must have changed in CentOS 7.7.

The intent is to disable the udev rules to prevent device renaming. There is an Ansible rule and pre-seed entry for Debian in the image builder.

I thought I included something similar (see step #5) for the Kickstart script for CentOS, but I do not see it there. I may have removed it since the author of the linked post stated:

This is irrelevant in CentOS/RHEL 7, but it won’t hurt anything.

I guess it did hurt something...at least with 7.7.

While your approaches look valid, I think it's first worth trying to disable the udev rules for CentOS when building the image by:

  1. Add the line rm -fr /etc/udev/rules.d/70* either above or below line #90 in the kickstart script.

And hopefully that should take care of that.

cc @figo @codenrhoden

p.s. Alternatively, I've also considered replacing NetworkManager from the CentOS image with systemd-networkd. This would put things in parity with the debian image as far as networking is concerned. The only caveat is that the cloud-init RedHat distro file would need to support the correct network output, and it currently selects the sysconfig approach :(

MnrGreg commented 4 years ago

Hello @akutz

The /etc/udev/rules.d/70-persistent-net.rules file is being created on first boot. It does not exist within the OVA post build. Also, if one boots the CAPV deployed OVA into single user mode, before cloud-init and the rest of the services have run, the /etc/udev/rules.d/ folder is empty.

I went so far as to symlink /dev/null to the file within the ansible playbook:

- name: Disable Udev persistent-net rules
  shell: ln -sf /dev/null /etc/udev/rules.d/70-persistent-net.rules

Unfortunately, on first boot something is still overwriting the same file with the UDEV rule. I assume this is either directly or indirectly caused by cloud-init which is also creating the /etc/sysconfig/network-scripts/**ifcfg-id0** file.

Will scratch a bit more when I get gap.

MnrGreg commented 4 years ago

@akutz @figo @codenrhoden

A little more digging and it seems that if set-name is not defined within the version 2 config, then cloud-inits network_state.py retrieves the name of the ethernet dict. In our case "id0":

        for eth, cfg in command.items():
            phy_cmd = {
                'type': 'physical',
                'name': cfg.get('set-name', eth),
            }

https://github.com/canonical/cloud-init/blob/ec6924ea1d321cc87e7414bee7734074590045b8/cloudinit/net/network_state.py#L645

2019-12-05 01:50:14,692 - network_state.py[DEBUG]: v2(ethernets) -> v1(physical): {'subnets': [{'dns_nameservers': ['10.10.10.10'], 'type': 'static', 'gateway': '10.7.7.254', 'address': '10.7.5.102/21'}], 'name': 'id0', 'mac_address': '00:50:56:a5:75:b3', 'type': 'physical', 'wakeonlan': True, 'match': {'macaddress': '00:50:56:a5:75:b3'}}

The sysconfig renderer then later uses this for the /etc/sysconfig/network-scripts/ifcfg- name and the DEVICE= parameter.

https://github.com/canonical/cloud-init/blob/ec6924ea1d321cc87e7414bee7734074590045b8/cloudinit/net/sysconfig.py#L701-L702

Probably something we'll need to report to cloud-init but it may take quite a while before it lands in CentOS. Might it be worth reconsidering option 2:

Add the set-name: field to the Cloud-Init Network Config Version 2 specification within pkg/util/constants.go. Let the field be user configurable and drop if not specified.

akutz commented 4 years ago

Ugh. Any side effect for Debian if we add the set-name? I will also need to or have someone check for Photon as well.

akutz commented 4 years ago

Then again, I could always just make the dict key follow the eth standard...

MnrGreg commented 4 years ago

Debian boots without issue, using whatever is defined in the set-name. kubeadm installs and it joins the cluster.

Photon seems to have some issue. Not sure if it's related. Will need to look into it. Also noticed ImageBuilder is configured with 3.0 and not 3.0 Rev2.

The Predictable Network Interface Naming was designed to avoid all the "seemly" random eth* numbering. By using the set-name and mac-address match you're achieving the same objective and would argue in an even more consistent, stable way.

MnrGreg commented 4 years ago

So interestingly Photo also wasn't applying the cloud-init v2 static IP configuration either, until set-name was specified.

Note the Photon systemd config with Name=id0:

$ cat /etc/systemd/network/10-id0.network
[Match]
Name=id0
[Network]
Address=10.7.5.101/21
Gateway=10.7.7.254
DHCP=no

Then after including the set-name: eth0 within cloud-init v2 config:

$ cat /etc/systemd/network/10-eth0.network
[Match]
Name=eth0
[Network]
Address=10.7.5.101/21
Gateway=10.7.7.254
DHCP=no

@akutz I suspect the project maybe hasn't noticed this as the test scripts don't make use of static IPs? Also, within the Photon OS the eth0 interfaces are configured with DHCP by default and so would come online irrespective.

cloud-init.log - without set-name configured:

2019-12-05 21:11:38,913 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.photon.Distro'>
2019-12-05 21:11:38,914 - __init__.py[DEBUG]: no interfaces to rename
2019-12-05 21:11:38,914 - __init__.py[DEBUG]: Datasource DataSourceVMwareGuestInfo not updated for events: System boot
2019-12-05 21:11:38,914 - stages.py[DEBUG]: No network config applied. Neither a new instance nor datasource network update on 'System boot' event
2019-12-05 21:11:38,914 - handlers.py[DEBUG]: start: init-network/setup-datasource: setting up datasource
2019-12-05 21:11:38,914 - DataSourceVMwareGuestInfo.py[INFO]: got host-info: {'network': {'interfaces': {'by-mac': OrderedDict([('00:50:56:a5:b2:b2', {'ipv6': [{'addr': 'fe80::250:56ff:fea5:b2b2%eth0', 'netmask': 'ffff:ffff:ffff:ffff::/64'}]})]), 'by-ipv4': OrderedDict(), 'by-ipv6': OrderedDict([('fe80::250:56ff:fea5:b2b2%eth0', {'netmask': 'ffff:ffff:ffff:ffff::/64', 'mac': '00:50:56:a5:b2:b2'})])}}, 'hostname': 'localhost', 'local-hostname': 'localhost'}
2019-12-05 21:11:38,916 - handlers.py[DEBUG]: finish: init-network/setup-datasource: SUCCESS: setting up datasource
2019-12-05 21:11:38,917 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/generic-k8s-workload-cluster-1-controlplane-0/user-data.txt - wb: [600] 19502 bytes
2019-12-05 21:11:38,921 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/generic-k8s-workload-cluster-1-controlplane-0/user-data.txt.i - wb: [600] 19801 bytes
2019-12-05 21:11:38,921 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/generic-k8s-workload-cluster-1-controlplane-0/vendor-data.txt - wb: [600] 0 bytes
2019-12-05 21:11:38,923 - util.py[DEBUG]: Writing to /var/lib/cloud/instances/generic-k8s-workload-cluster-1-controlplane-0/vendor-data.txt.i - wb: [600] 308 bytes
2019-12-05 21:11:38,924 - util.py[DEBUG]: Reading from /var/lib/cloud/data/set-hostname (quiet=False)
2019-12-05 21:11:38,924 - util.py[DEBUG]: Read 123 bytes from /var/lib/cloud/data/set-hostname
2019-12-05 21:11:38,924 - cc_set_hostname.py[DEBUG]: Setting the hostname to localhost (localhost)
2019-12-05 21:11:38,924 - util.py[DEBUG]: Reading from /etc/hostname (quiet=False)

cloud-init.log - with set-name: eth0 configured:

2019-12-05 21:57:19,179 - util.py[DEBUG]: Running command ['ip', '-6', 'addr', 'show', 'permanent', 'scope', 'global'] with allowed return codes [0] (shell=False, capture=True)
2019-12-05 21:57:19,190 - util.py[DEBUG]: Running command ['ip', '-4', 'addr', 'show'] with allowed return codes [0] (shell=False, capture=True)
2019-12-05 21:57:19,198 - __init__.py[DEBUG]: no work necessary for renaming of [['00:50:56:a5:b2:b2', 'eth0', 'vmxnet3', '0x07b0']]
2019-12-05 21:57:19,198 - stages.py[INFO]: Applying network configuration from system_cfg bringup=False: {'version': 2, 'ethernets': {'id0': {'match': {'macaddress': '00:50:56:a5:b2:b2'}, 'wakeonlan': True, 'set-name': 'eth0', 'dhcp4': False, 'dhcp6': False, 'addresses': ['10.7.5.101/21'], 'gateway4': '10.7.7.254', 'nameservers': {'addresses': ['10.10.10.10']}}}}
2019-12-05 21:57:19,199 - __init__.py[WARNING]: apply_network_config is not currently implemented for distribution '<class 'cloudinit.distros.photon.Distro'>'.  Attempting to use apply_network
2019-12-05 21:57:19,199 - network_state.py[DEBUG]: v2(ethernets) -> v1(physical):
{'type': 'physical', 'name': 'eth0', 'mac_address': '00:50:56:a5:b2:b2', 'match': {'macaddress': '00:50:56:a5:b2:b2'}, 'wakeonlan': True, 'subnets': [{'type': 'static', 'address': '10.7.5.101/21', 'gateway': '10.7.7.254', 'dns_nameservers': ['10.10.10.10']}]}
2019-12-05 21:57:19,206 - network_state.py[DEBUG]: v2_common: handling config:
{'id0': {'match': {'macaddress': '00:50:56:a5:b2:b2'}, 'wakeonlan': True, 'set-name': 'eth0', 'dhcp4': False, 'dhcp6': False, 'addresses': ['10.7.5.101/21'], 'gateway4': '10.7.7.254', 'nameservers': {'addresses': ['10.10.10.10']}}}
2019-12-05 21:57:19,207 - photon.py[DEBUG]: Translated ubuntu style network settings # Converted from network_config for distro <class 'cloudinit.distros.photon.Distro'>
 Implementation of _write_network_config is needed.
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
    hwaddress 00:50:56:a5:b2:b2
    address 10.7.5.101/21
    dns-nameservers 10.10.10.10
    gateway 10.7.7.254
 into {'lo': {'ipv6': {}, 'auto': True}, 'eth0': {'ipv6': {}, 'bootproto': 'static', 'address': '10.7.5.101', 'gateway': '10.7.7.254', 'netmask': '255.255.248.0', 'broadcast': '10.7.7.255', 'dns-nameservers': ['10.10.10.10'], 'auto': True}}
2019-12-05 21:57:19,208 - util.py[DEBUG]: Writing to /etc/systemd/network/10-eth0.network - wb: [644] 77 bytes
akutz commented 4 years ago

Hi @MaxRink,

You know better than I at this point. What makes more sense? Making the prefix of the scalar id0 configurable or implementing set-name? From what you said it's the same thing. I was thinking we'd make the scalar prefix configurable and default it to eth. Unless that means Photon will always treat that device as DHCP regardless?

MnrGreg commented 4 years ago

Hey Andrew, Greg here.

Either will work for now, but if we go the route of naming the scaler as eth*, when the issue with the Device Configuration ID is eventually fixed in cloud-init v2, it will likely cause our then scalar named eth* to go back to their Predictable Network Interface Names e.g. ens192. This will cause confusion.

If we were to leave the Device Configuration IDs as id* and use the set-name as eth* then nothing will change when cloud-init v2 is fixed.

The question is, should we hardcode set-name to eth* in pkg/util/constants.go or should we make it user configurable. Or maybe better- should we make set-name user configuration but override with eth{{ $i }} if not configured.

For example:

network:
  version: 2
  ethernets:
    {{- range $i, $net := .Devices }}
    id{{ $i }}:
      match:
        macaddress: "{{ $net.MACAddr }}"
      {{- if $net.set-name }}
      set-name: "{{ $net.set-name }}"
      {{- else }}
      set-name: "eth{{ $i }}"
      {{- end }}

If this makes sense, and it helps you out, I can submit the PR for this?

akutz commented 4 years ago

sgtm @MaxRink. If you submit the PR I will take a look ASAP. I also need to remember to open a PR that augments our e2e to deploy a machine from all three templates we use - CentOS, Ubuntu, and Photon. I know we technically need to test static IPs as well, but that would involve our test infra managing some static IP pool to avoid conflicts. :(

MnrGreg commented 4 years ago

Yup, I noticed some Issues such as https://github.com/kubernetes-sigs/cluster-api/issues/1740 that mention possibly incorporating Static IP use.

Perhaps we need an IPAM Controller similar the envisaged Load Balancer Controller to address this. I can imagine there must be a number or customers that use IPAM over DHCP on-premises. Also having stable IPs might simplify the Load Balancer Controller configurations.

akutz commented 4 years ago

ping @MaxRink - we need to figure out how to resolve this I think...

akutz commented 4 years ago

/close

k8s-ci-robot commented 4 years ago

@akutz: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/issues/583#issuecomment-573153553): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.