projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.04k stars 1.35k forks source link

networking_calico: dhcp agent doesn't support multiple segmentations per network with Openstack version 2023.1 or newer #9216

Open sp3c1k opened 2 months ago

sp3c1k commented 2 months ago

Hello,

after upgrading our Openstack to version 2024.1 (Caracal), we encountered some issues/errors regarding dhcp agent.

2024-09-10 14:06:09.628 3173368 ERROR neutron.agent.dhcp.agent [-] Unable to restart dhcp for b51178a6-eb4f-4a9e-b3cc-fd93e053ef18.: TypeError: DnsmasqRouted.__init__() takes from 4 to 6 positional arguments but 7 were given
2024-09-10 14:06:09.628 3173368 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2024-09-10 14:06:09.628 3173368 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 268, in _call_driver
2024-09-10 14:06:09.628 3173368 ERROR neutron.agent.dhcp.agent     driver = self.dhcp_driver_cls(self.conf,
2024-09-10 14:06:09.628 3173368 ERROR neutron.agent.dhcp.agent TypeError: DnsmasqRouted.__init__() takes from 4 to 6 positional arguments but 7 were given

From what I have found, this seems to be due to: https://review.opendev.org/c/openstack/neutron/+/840421 (dhcp: support multiple segmentations per network) where they added segment arg here: https://github.com/openstack/neutron/blob/5f42221e3b5b9f7e0c391e7c9b88ca93a41914ec/neutron/agent/dhcp/agent.py#L263

Networking_calico dhcp agent is currently not compatible with this change as it does not accept such argument.

class DnsmasqRouted(dhcp.Dnsmasq):
    """Dnsmasq DHCP driver for routed virtual interfaces."""

    def __init__(self, conf, network, process_monitor,
                 version=None, plugin=None):
        super(DnsmasqRouted, self).__init__(conf, network, process_monitor,
                                            version, plugin)

We ad-hoc fixed it by adding *args to the __init__ method of class DnsmasqRouted(dhcp.Dnsmasq):

class DnsmasqRouted(dhcp.Dnsmasq):
    """Dnsmasq DHCP driver for routed virtual interfaces."""

    def __init__(self, conf, network, process_monitor,
                 version=None, plugin=None, *args):
        if args:
            super(DnsmasqRouted, self).__init__(conf, network, process_monitor,
                                                version, plugin, *args)
        else:
            super(DnsmasqRouted, self).__init__(conf, network, process_monitor,
                                                version, plugin)
        self.device_manager = CalicoDeviceManager(self.conf, plugin)

We chose this approach as this should keep the compatibility with older Openstack versions (Zed and older). Simply adding segment=None would not work, as older Openstack versions does not support this variable in DhcpBase class. https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/agent/linux/dhcp.py#L183-L191

Then a second problem appeared:

2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent [-] Unable to restart dhcp for b51178a6-eb4f-4a9e-b3cc-fd93e053ef18.: AttributeError: 'FakePlugin' object has no attribute 'get_ports'
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent Traceback (most recent call last):
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/dhcp/agent.py", line 274, in _call_driver
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     rv = getattr(driver, action)(**action_kwargs)
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 208, in restart
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     self.enable()
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 325, in enable
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     common_utils.wait_until_true(self._enable, timeout=300)
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/common/utils.py", line 742, in wait_until_true
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     while not predicate():
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 340, in _enable
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     self.spawn_process()
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 589, in spawn_process
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     self._spawn_or_reload_process(reload_with_HUP=False)
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 598, in _spawn_or_reload_process
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     self._output_config_files()
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 652, in _output_config_files
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     self._output_opts_file()
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1205, in _output_opts_file
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     options, subnet_index_map = self._generate_opts_per_subnet()
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1287, in _generate_opts_per_subnet
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     ovn_metadata_port_ip = self._get_ovn_metadata_port_ip(subnet)
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent   File "/usr/lib/python3/dist-packages/neutron/agent/linux/dhcp.py", line 1214, in _get_ovn_metadata_port_ip
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent     ports_result = self.device_manager.plugin.get_ports(
2024-09-10 15:22:49.344 3177378 ERROR neutron.agent.dhcp.agent AttributeError: 'FakePlugin' object has no attribute 'get_ports'

This seems to be related to:

From my understanding, it seems this change was made because when using neutron-dhcp-agent with ml2/ovn on ironic/baremetal nodes, ovn metadata route was not added correctly. Dhcp agent now uses get_ports() which is on the other side exposed by MetadataRpcCallback(server side) API. Since calico uses FakePlugin to support various calls that neutron.agent.linux.dhcp.Dnsmasq makes to what it thinks is Neutron database, I think it is necessary to add get_ports() to FakePlugin.

This is what we added and it seems to have resolved our problem:

    def get_ports(self, port_filters): 

        return None

I think it should be okay if it returns None, because:

Because this class doesn't speak to the real Neutron database, it follows that the DHCP interface that we create on each compute host does not show up as a port in the Neutron database. That doesn't matter, because we don't allocate a unique IP for each DHCP port, and hence don't consume any IPs that the Neutron database ought to know about.

If the DHCP interfaces we create does not end up in Neutron db to begin with, it should return None anyway.

Expected Behavior

Networking-calico and DHCP agent implementation should work with newer Openstack releases (2023.1 and newer)

Current Behavior

Current DHCP agent implementation seems to have problems with newer Openstack releases (2023.1 and newer)

Possible Solution

As written above, but it would be great if someone else confirmed these changes would make sense and would not break current behavior.

Steps to Reproduce (for bugs)

  1. Use Openstack release 2023.1 or newer with Calico as the network plugin and as a dhcp agent

Context

This issue was encountered after upgrading Openstack to a newer version.

Your Environment

matthewdupre commented 2 months ago

Makes sense to me - we'd take a PR to change fix the Caracal gaps. @nelljerram WDYT in terms of approach (especially use of *args)?

nelljerram commented 2 months ago

All sounds good to me. For the avoidance of doubt, I believe we are only talking here about fixing up our coding to tolerate new args in some calls, and we are not talking about honouring the Neutron DB concept of segments in detail. IIRC segments were introduced as part of the idea of a Neutron network being partly routed - i.e. it has various L2 segments, with L3 routing between them. Calico has always been completely L3 routed - i.e. there is no L2 adjacency between workloads - and so it does not make sense for Calico to honour / implement the segment resource in detail.

Probably that was already well known to everyone here - but I thought I should say it just to make sure!

sp3c1k commented 2 months ago

All sounds good to me. For the avoidance of doubt, I believe we are only talking here about fixing up our coding to tolerate new args in some calls, and we are not talking about honouring the Neutron DB concept of segments in detail.

We use Calico as core plugin for Neutron, therefore we depend entirely on the L3 networking Calico provides. I believe, at least for our use case, it should be sufficient to just support the new arguments that are passed to some calls by the new neutron code and not the whole concept of Neutron segments.