kubernetes-sigs / cluster-api-provider-openstack

Cluster API implementation for OpenStack
https://cluster-api-openstack.sigs.k8s.io/
Apache License 2.0
279 stars 252 forks source link

Incorrect FloatingIP workflow #1985

Closed serge-name closed 3 months ago

serge-name commented 3 months ago

/kind bug

What steps did you take and what happened: I tried capo build for 1d5d2d5e45462dab056e37a6c948361e81875ea9. Some key details follow:

1) Created a OpenStackFloatingIPPool (non-relevant fields removed)

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: OpenStackFloatingIPPool
metadata:
  name: osfipp
spec:
  floatingIPNetwork:
    id: c7c8509d-7083-41c9-b799-e30e855e9bc0
  reclaimPolicy: Delete
  # …

2) created a MachineDeployment and OpenStackMachineTemplate

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: OpenStackMachineTemplate
metadata:
  name: some
spec:
  template:
    spec:
      ports:
        - network:
            id: f16855bf-8ba1-4f75-ad8c-763e80134571
      floatingIPPoolRef:
        apiGroup: infrastructure.cluster.x-k8s.io/v1beta1
        kind: OpenStackFloatingIPPool
        name: osfipp
# …

✅ Floating IP was successfully created. Here we get correct data fip.FloatingIP == "185.***.**.**", fip.FloatingNetworkID == "c7c8509d-7083-41c9-b799-e30e855e9bc0": https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1d5d2d5e45462dab056e37a6c948361e81875ea9/controllers/openstackmachine_controller.go#L440-L443

❌ Here we get port == nil and an error "Failed while associating ip from pool: port for floating IP \"185.*..**\" on network c7c8509d-7083-41c9-b799-e30e855e9bc0 does not exist": https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1d5d2d5e45462dab056e37a6c948361e81875ea9/controllers/openstackmachine_controller.go#L450-L458

More details follow.

Here: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1d5d2d5e45462dab056e37a6c948361e81875ea9/pkg/cloud/services/networking/port.go#L65

Openstack API returns the following (non-relevant fields skipped):

{
  "ports": [
    {
      "device_id": "d1b99e45-991c-4143-93a3-9a8d3eddb416",
      "device_owner": "compute:nova",
      "fixed_ips": [
        {
          "ip_address": "10.21.10.29",
          "subnet_id": "616388c0-519f-418e-80b4-3687a546a65e"
        }
      ],
      "id": "0d1fe3bd-55f6-41d0-b879-a4071a15b5c0",
      "network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571"
// …
    }
  ]
}

Please notice that we don't have a port associated with FIP network c7c8509d-7083-41c9-b799-e30e855e9bc0. And both FIP network ID and the FIP itself are not going to appear in the ports info because in our Openstack cloud floating IPs are not being added to ports directly. But NAT 185.***.**.**10.21.10.29 would be set up.

If the new k8s node got FIP it could be found here: https://compute-api:8774/v2.1/TENANT_ID/servers/d1b99e45-991c-4143-93a3-9a8d3eddb416

And the reply might be looking like this (non-relevant fields skipped):

{ "server": {
    "id": "d1b99e45-991c-4143-93a3-9a8d3eddb416",
    "hci_info": {
      "network": [
        {
          "ips": [
            "10.21.10.29"
          ],
          "network": {
            "id": "f16855bf-8ba1-4f75-ad8c-763e80134571",
            "subnets": [
              {
                "ips": [
                  {
                    "address": "10.21.10.29",
                    "type": "fixed",
                    "version": 4,
                    "floating_ips": [
                      {
                        "address": "185.***.**.**",
                        "type": "floating",
                        "version": 4,
                      }
                    ]
                  } ] } ] } } ] } } }

Here it tries to find a fixed IP in the FIP network but in our openstack cloud all FIPs have device_owner == "network:floatingip" so it gets just an empty list: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1d5d2d5e45462dab056e37a6c948361e81875ea9/pkg/cloud/services/networking/port.go#L71-L76

What did you expect to happen: Successfully deployed k8s node with FIP attached.

Anything else you would like to add: None so far. But please ask me any details. The issue is reproducible and I can add even more details if you want.

Environment:

mdbooth commented 3 months ago

/cc @huxcrux @bilbobrovall

bilbobrovall commented 3 months ago

What does f16855bf-8ba1-4f75-ad8c-763e80134571 look like, does it have a router?

It's not really documented, but we don't create any new ports for the FIPs, we just look for an existing port that the FIP can be attached to by checking if there's a port with a subnet that has an attached router to the floating ip network.

I've mostly tested it out with spec.ports omitted with the default setup, but I can test it out with something closer to your setup if I know more about how that network is setup.

serge-name commented 3 months ago

Yes, I meant that the new port is being created by Openstack. But not in our cloud. I'm not so familiar with Openstack internals and don't have an access to different configurations except our particular cloud.

GET https://compute-api:9696/v2.0/networks/f16855bf-8ba1-4f75-ad8c-763e80134571
{
  "network": {
    "id": "f16855bf-8ba1-4f75-ad8c-763e80134571",
    "name": "internal",
    "tenant_id": "278fda03174b4fee9358559baffca010",
    "admin_state_up": true,
    "mtu": 8913,
    "default_vnic_type": null,
    "status": "ACTIVE",
    "subnets": [
      "616388c0-519f-418e-80b4-3687a546a65e"
    ],
    "shared": false,
    "availability_zone_hints": [],
    "availability_zones": [
      "nova"
    ],
    "ipv4_address_scope": null,
    "ipv6_address_scope": null,
    "router:external": false,
    "description": "",
    "port_security_enabled": true,
    "rbac_policies": [
      {
        "id": "c869c7ef-3c51-4fb6-88f5-c591989fe3ef",
        "action": "access_as_shared",
        "target_tenant": "d278dea8631e47ffba5a908265968fbb"
      }
    ],
    "qos_policy_id": null,
    "tags": [],
    "created_at": "2024-02-06T12:43:10Z",
    "updated_at": "2024-03-20T20:39:09Z",
    "revision_number": 5,
    "project_id": "278fda03174b4fee9358559baffca010",
    "provider:network_type": "vxlan"
  }
}
GET https://compute-api:9696/v2.0/routers/7142d8f1-2b11-4ae2-a343-eacd77a2ceee
{
  "router": {
    "id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee",
    "name": "DefaultRouter",
    "tenant_id": "278fda03174b4fee9358559baffca010",
    "admin_state_up": true,
    "status": "ACTIVE",
    "external_gateway_info": {
      "network_id": "c7c8509d-7083-41c9-b799-e30e855e9bc0",
      "external_fixed_ips": [
        {
          "subnet_id": "aa2bc8f7-fa02-4851-ba13-93e57d4c69e1",
          "ip_address": "69.**.**.**"
        }
      ],
      "enable_snat": true
    },
    "description": "",
    "availability_zones": [
      "nova"
    ],
    "availability_zone_hints": [],
    "routes": [
    ],
    "flavor_id": null,
    "tags": [],
    "created_at": "2024-02-06T11:49:58Z",
    "updated_at": "2024-03-29T14:41:39Z",
    "revision_number": 17,
    "project_id": "278fda03174b4fee9358559baffca010"
  }
}

That router's external_fixed_ips is automatically pre-created by Openstack.

If a VM has FIP attached then outgoing connections are being SNAT'ed from that FIP. IF a VM has no FIP then connections are being SNAT'ed from the router's external IP.

GET https://compute-api:9696/v2.0/ports?device_id=7142d8f1-2b11-4ae2-a343-eacd77a2ceee
{
  "ports": [
    {
      "id": "0411af2f-d447-4f3c-88a7-1e8a57e70015",
      "name": "",
      "network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571",
      "tenant_id": "",
      "mac_address": "fa:16:3e:44:38:7e",
      "admin_state_up": true,
      "status": "ACTIVE",
      "device_id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee",
      "device_owner": "network:router_centralized_snat",
      "fixed_ips": [
        {
          "subnet_id": "616388c0-519f-418e-80b4-3687a546a65e",
          "ip_address": "10.21.11.1"
        }
      ],
      "allowed_address_pairs": [],
      "extra_dhcp_opts": [],
      "security_groups": [],
      "description": "",
      "binding:vnic_type": "normal",
      "port_security_enabled": false,
      "qos_policy_id": null,
      "qos_network_policy_id": null,
      "tags": [],
      "created_at": "2024-02-06T14:02:02Z",
      "updated_at": "2024-03-23T18:11:57Z",
      "revision_number": 40,
      "project_id": ""
    },
    {
      "id": "ded9eafe-3ee0-4f29-9f7f-953470f3a3ae",
      "name": "",
      "network_id": "f16855bf-8ba1-4f75-ad8c-763e80134571",
      "tenant_id": "278fda03174b4fee9358559baffca010",
      "mac_address": "fa:16:3e:48:d2:da",
      "admin_state_up": true,
      "status": "ACTIVE",
      "device_id": "7142d8f1-2b11-4ae2-a343-eacd77a2ceee",
      "device_owner": "network:router_interface_distributed",
      "fixed_ips": [
        {
          "subnet_id": "616388c0-519f-418e-80b4-3687a546a65e",
          "ip_address": "10.21.10.1"
        }
      ],
      "allowed_address_pairs": [],
      "extra_dhcp_opts": [],
      "security_groups": [],
      "description": "",
      "binding:vnic_type": "normal",
      "port_security_enabled": false,
      "qos_policy_id": null,
      "qos_network_policy_id": null,
      "tags": [],
      "created_at": "2024-02-06T14:02:02Z",
      "updated_at": "2024-04-02T10:33:28Z",
      "revision_number": 68,
      "project_id": "278fda03174b4fee9358559baffca010"
    }
  ]
}

I've came up with a quick fix already: https://github.com/serge-name/cluster-api-provider-openstack/commit/bb19917957b82959f8406ed9778eebf82ebd7855 works fine so far. Right now I am short in time to create a decent PR.

bilbobrovall commented 3 months ago

https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/1d5d2d5e45462dab056e37a6c948361e81875ea9/pkg/cloud/services/networking/port.go#L73 Does it work for you if you replace network:router_interface with network:router_interface_distributed?

serge-name commented 3 months ago

Yes, network:router_interface_distributed works absolutely fine. As it is in the commit https://github.com/serge-name/cluster-api-provider-openstack/commit/a1bf5b88e40b9bc6c5d5f5208628a3e0193e70fe

serge-name commented 3 months ago

@bilbobrovall thanks a lot! Your commit https://github.com/elastx/cluster-api-provider-openstack/commit/ce38e8be47b3899ad8dc68e49608fdb2ffcd4a4d works fine for me and fixes the issue.

There are several minor errors due to premature and frequent (8 API reqs in 2 seconds) checks for FIP. Not a problem for me, just a thing that can be improved later. Logs are follow:

minor_errors.txt

bilbobrovall commented 3 months ago

@bilbobrovall thanks a lot! Your commit elastx@ce38e8b works fine for me and fixes the issue.

There are several minor errors due to premature and frequent (8 API reqs in 2 seconds) checks for FIP. Not a problem for me, just a thing that can be improved later. Logs are follow:

minor_errors.txt

:+1: It's probably just neutron taking some time, and I think the retries should be fine for now since there's an exponential backoff when a reconciler returns the same error, but the initial retries feels a bit tight in this case.