openstack_networking_port_v2 all_fixed_ips is empty after creation (first run) but filled after rm from state (second run)

frittentheke commented 1 year ago

Terraform Version

Terraform v1.5.5

provider registry.terraform.io/terraform-provider-openstack/openstack v1.52.1

Affected Resource(s)

Please list the resources as a list, for example:

openstack_networking_port_v2

Terraform Configuration Files

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
}

resource "openstack_networking_router_route_v2" "vpn" {
  router_id        = var.router_id
  destination_cidr = var.cidr
  next_hop         = openstack_networking_port_v2.vpn.all_fixed_ips[0]
}

Debug Output

Panic Output

Expected Behavior

The openstack_networking_port_v2 is used as interface for an instance providing a VPN service. I port is used in order to have a fixed / known IP address. I would expect the IP of the port_v2 to be returned as first element in the all_fixed_ips array to then be used the next_hop for a static route.

In short: I want to route the network behind the VPN to the corresponding instance via a static route.

Actual Behavior

There are two errors thrown in relation to the port just created:

[...]
│ Error: Error creating OpenStack server: Bad request with: [POST https://compute.region.cloud.example.com/v2.1/servers], error message: {"badRequest": {"code": 400, "message": "Port ca6b83fd-e624-4e91-8dac-db291be55a42 requires a FixedIP in order to be used."}}
│ 
│   with module.vpn-server.openstack_compute_instance_v2.vpn,
│   on .terraform/modules/vpn-server/server/main.tf line 154, in resource "openstack_compute_instance_v2" "vpn":
│  154: resource "openstack_compute_instance_v2" "vpn" {
│ 
╵
╷
│ Error: Invalid index
│ 
│   on .terraform/modules/vpn-server/server/main.tf line 181, in resource "openstack_networking_router_route_v2" "vpn":
│  181:   next_hop         = openstack_networking_port_v2.vpn.all_fixed_ips[0]
│     ├────────────────
│     │ openstack_networking_port_v2.vpn.all_fixed_ips is empty list of string
│ 
│ The given key does not identify an element in this collection value: the collection has no elements.
[...]

causing the terraform run to abort with an error.

Steps to Reproduce

After a terraform apply the error is reached
Then terraform state rm openstack_networking_port_v2.vpn
terraform apply again and things work just fine.

It seems the port resource takes longer to be fully created and initialized and the provider moves on too early. Just a refresh on the just created resource? Or some other indication of the port being actually done provisioning has to be tracked via the API?

Important Factoids

References

The issue https://github.com/terraform-provider-openstack/terraform-provider-openstack/issues/1384 seems to also target the return addresses.

nikParasyr commented 1 year ago

@frittentheke thanks for reporting this.

I speculate that this is occuring because the Openstack api responds to the port creation before an ip gets allocated by the dhcp server, and also the following Get for the port happens also before an ip is allocated.

Can you run the following scenario for me:

Run initial terraform apply
Optional: check how the port resource is written in the state
Re-run terraform apply (without any changes to the code)
Check again the state

frittentheke commented 1 year ago

@nikParasyr thanks for the fast response!

I did run the scenario you asked for. Full disclosure, there are quite a few resources spawned and the port for the instance running the VPN services is created by a terraform module .... but here we go:

terraform apply ended with the error I reported about all_fixed_ips being empty.

This is the state for the ressource:


terraform state show module.vpn-server.openstack_networking_port_v2.vpn
# module.vpn-server.openstack_networking_port_v2.vpn:
resource "openstack_networking_port_v2" "vpn" {
admin_state_up         = true
all_fixed_ips          = []
all_security_group_ids = [
    "87acb073-5123-4473-b33b-fc78f522c6b8",
]
all_tags               = []
dns_assignment         = []
id                     = "9b37978b-ed53-41c2-983f-31570eb88259"
mac_address            = "fa:16:3e:3a:58:ec"
name                   = "vpn"
network_id             = "f946cedc-94d1-4bde-a680-f59d615ad2e3"
port_security_enabled  = true
region                 = "fra"
security_group_ids     = [
    "87acb073-5123-4473-b33b-fc78f522c6b8",
]
tenant_id              = "REDACTED"

allowed_address_pairs {
    ip_address = "10.3.4.0/24"
}

binding {
    vif_details = {}
    vnic_type   = "normal"
}
}

so `all_fixed_ips` is empty.

3. Running terraform apply again does not work, it's ending up with the same error about `all_fixed_ips`
4. State remains unchanged.

But we dug a little deeper:

1. `terraform refresh` does NOT update the `all_fixed_ips` (if called implicitly by the apply or explicitly)
2. `terraform apply -target module.vpn-server.openstack_networking_port_v2.vpn` does "work", but finds nothing that needs changing. So also then the field is not populated.
3. Certainly the `terraform state rm module.vpn-server.openstack_networking_port_v2.vpn` which I did initially of course caused a new port to be created leaving the first one dangling. But then `all_fixed_ips` was set, so the other resource referring that worked fine.
4. As I said there a quite a few resources in this terraform code, so I believe this is the reason the port resource does not "work" the same way for the first apply doing everything, but for the second attempt with only this port and the static route being changed / created via the API. Read: convergence time.

This is the terraform debug output / openstack API response to the port creation (initial terraform apply):

[...] 2023-08-23T12:06:29.893+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request URL: GET https://network.regsion.cloud.example.com/v2.0/ports?id=9b37978b-ed53-41c2-983f-31570eb88259: timestamp=2023-08-23T12:06: 29.893+0200 2023-08-23T12:06:29.893+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request Headers: Accept: application/json Cache-Control: no-cache User-Agent: HashiCorp Terraform/1.5.5 (+https://www.terraform.io) Terraform Plugin SDK/2.10.1 gophercloud/v1.4.0 X-Auth-Token: ***: timestamp=2023-08-23T12:06:29.893+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Code: 200: timestamp=2023-08-23T12:06:29.983+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Headers: Content-Type: application/json Date: Wed, 23 Aug 2023 10:06:29 GMT Server: Apache Strict-Transport-Security: max-age=63072000 Vary: Accept-Encoding Via: 1.1 network.region.cloud.example.com X-Openstack-Request-Id: req-1b02e0f3-442d-4b85-a9ce-40a765d05fb5: timestamp=2023-08-23T12:06:29.983+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Body: { "ports": [ { "admin_state_up": true, "allowed_address_pairs": [ { "ip_address": "10.3.4.0/24", "mac_address": "fa:16:3e:3a:58:ec" } ], "binding:vnic_type": "normal", "created_at": "2023-08-23T10:06:12Z", "description": "", "device_id": "", "device_owner": "", "dns_assignment": [], "dns_name": "", "extra_dhcp_opts": [], "fixed_ips": [], "id": "9b37978b-ed53-41c2-983f-31570eb88259", "mac_address": "fa:16:3e:3a:58:ec", "name": "vpn", "network_id": "f946cedc-94d1-4bde-a680-f59d615ad2e3", "port_security_enabled": true, "project_id": "REDACTED", "revision_number": 1, "security_groups": [ "87acb073-5123-4473-b33b-fc78f522c6b8" ], "status": "DOWN", "tags": [], "tenant_id": "REDACTED", "updated_at": "2023-08-23T10:06:12Z" } ] }: timestamp=2023-08-23T12:06:29.983+0200 [...]

nikParasyr commented 1 year ago

From neutron docs i see that port creation returns 201. Also, I am unable to reproduce this on my environment:

2023-08-24T15:50:43.933+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:43 [DEBUG] OpenStack Request Body: {
  "port": {
    "admin_state_up": true,
    "name": "vpn",
    "network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a"
  }
}: timestamp=2023-08-24T15:50:43.933+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Code: 201: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Headers:
Content-Length: 699
Content-Type: application/json
Date: Thu, 24 Aug 2023 13:50:44 GMT
X-Openstack-Request-Id: req-7eb97add-533b-46f0-a0d3-31bae4ac65e9: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO]  provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Body: {
  "port": {
    "admin_state_up": true,
    "allowed_address_pairs": [],
    "binding:vnic_type": "normal",
    "created_at": "2023-08-24T13:50:44Z",
    "description": "",
    "device_id": "",
    "device_owner": "",
    "extra_dhcp_opts": [],
    "fixed_ips": [
      {
        "ip_address": "192.168.1.102",
        "subnet_id": "60882bd0-9597-4433-b841-aad6868d82b7"
      }
    ],
    "id": "0c092325-05ee-4dfd-85ec-a9571a37310a",
    "mac_address": "fa:16:1e:c7:be:fe",
    "name": "vpn",
    "network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a",
    "port_security_enabled": true,
    "project_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
    "revision_number": 1,
    "security_groups": [
      "d6e94844-3231-42ca-bd35-7cc1a68bd095"
    ],
    "status": "DOWN",
    "tags": [],
    "tenant_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
    "updated_at": "2023-08-24T13:50:44Z"
  }
}: timestamp=2023-08-24T15:50:44.505+0200

So fixed-ip is already populated in the response, and is written to the state correctly.

@frittentheke what behavior do you get via the cli?

❯ openstack port create --network 157c19ff-a568-45bc-88a7-0ec62d5a7a7a vpn                                                                                                                                                                                                                                                                                                                                                              ─╯
+-------------------------+-----------------------------------------------------------------------------+
| Field                   | Value                                                                       |
+-------------------------+-----------------------------------------------------------------------------+
| admin_state_up          | UP                                                                          |
| allowed_address_pairs   |                                                                             |
| binding_host_id         | None                                                                        |
| binding_profile         | None                                                                        |
| binding_vif_details     | None                                                                        |
| binding_vif_type        | None                                                                        |
| binding_vnic_type       | normal                                                                      |
| created_at              | 2023-08-24T14:13:46Z                                                        |
| data_plane_status       | None                                                                        |
| description             |                                                                             |
| device_id               |                                                                             |
| device_owner            |                                                                             |
| device_profile          | None                                                                        |
| dns_assignment          | None                                                                        |
| dns_domain              | None                                                                        |
| dns_name                | None                                                                        |
| extra_dhcp_opts         |                                                                             |
| fixed_ips               | ip_address='192.168.1.88', subnet_id='60882bd0-9597-4433-b841-aad6868d82b7' |
| id                      | 65822420-e3f9-47d2-9b20-63cb6dd9dd4c                                        |
| ip_allocation           | None                                                                        |
| mac_address             | fa:16:1e:ad:45:b2                                                           |
| name                    | vpn                                                                         |
| network_id              | 157c19ff-a568-45bc-88a7-0ec62d5a7a7a                                        |
| numa_affinity_policy    | None                                                                        |
| port_security_enabled   | True                                                                        |
| project_id              | ed498e81f0cc448bae0ad4f8f21bf67f                                            |
| propagate_uplink_status | None                                                                        |
| qos_network_policy_id   | None                                                                        |
| qos_policy_id           | None                                                                        |
| resource_request        | None                                                                        |
| revision_number         | 1                                                                           |
| security_group_ids      | d6e94844-3231-42ca-bd35-7cc1a68bd095                                        |
| status                  | DOWN                                                                        |
| tags                    |                                                                             |
| trunk_details           | None                                                                        |
| updated_at              | 2023-08-24T14:13:46Z                                                        |
+-------------------------+-----------------------------------------------------------------------------+

Also, are you aware if your openstack environment has any specific neutron/dhcp configuration? Anything that could make a port to get an ip after a delay?

frittentheke commented 1 year ago

Maybe a few basics:

I am running the OpenStack Yoga release
Neutron uses the linuxbridge driver, HA is used (l3_ha=True), with max_l3_agents_per_router=3 and dhcp_agents_per_network=3 ** so maybe the RPC takes a little longer if there are many things to do / apply to the same router?

Looking at https://github.com/openstack/neutron/blob/5d97b13c7978c70673d1c886f0c49319076fdec5/neutron/db/models_v2.py#L110 makes me wonder if the port object might be returned without the IPAllocations if they are not already there when the subquery happens?

Diving into how an API call to create a port is distributed is kind of a rabbit hole ...

There is just so much code dealing with ports and their IPs ...

any I believe some of this is done asynchronously racing the API response for the newly created port and its all_fixed_ips field.

frittentheke commented 1 year ago

@nikParasyr Is there any way I could assist more with this issue? Is this potentially even a bug with Neutron returning the port creation response prematurely?

nikParasyr commented 1 year ago

@frittentheke I'm not sure how to tackle this tbh. I'm also busy with some personal stuff for the next 2 weeks.

Is this potentially even a bug with Neutron returning the port creation response prematurely?

It could be, but im not 100%. ( We also run l3_ha x3 in our site and i get an ip instantly)

We could potentially add a wait till the fixed_ip is populated.

@kayrus any ideas?

frittentheke commented 1 year ago

@nikParasyr thanks again for digging into this issue! Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially return the port without fixed_ips populated?

I raised a bug with Neutron https://bugs.launchpad.net/neutron/+bug/2035230 to ask if this behavior is expected. Would not want to add polling code and timeouts to the provider if this was an API issue in the first place.

nikParasyr commented 1 year ago

Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially return the port without fixed_ips populated?

The bug you opened is a way. there is already a response. Otherwise IRC channels are also an option: https://docs.openstack.org/contributors/common/irc.html

frittentheke commented 11 months ago

@nikParasyr ... did you see https://bugs.launchpad.net/neutron/+bug/2035230/comments/3 ? So in short: it's expected that the fixed_ips are not initially returned and need to be waited for.

Do you see any chance this could be fixed?

nikParasyr commented 11 months ago

@frittentheke are you creating the subnet in the same run? and if so can you add a depends_on on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside the fixed_ip block)?

From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.

In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.

nikParasyr commented 11 months ago

@frittentheke were you successful when adding depends_on / defining subnet_id ?

mnaser commented 11 months ago

I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like ip_allocation set to deferred in this case.

https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html

Could this be the case?

frittentheke commented 10 months ago

@frittentheke are you creating the subnet in the same run? and if so can you add a depends_on on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside the fixed_ip block)?

From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.

@nikParasyr yes, we are creating a number of networking resources in a single run. So a router, multiple networks and subnets, ...

The issue is not occurring 100% of the time. So it's a race condition. If it occurs it's enough to replace the port resource which then receives the "missing" all_fixed_ips" being the only resource to be created.

So testing with depends_on is not that easy. Maybe(tm) it does help as it causes some more serialization, but likely it's not tackling the root cause of a deferred / delayed IP address allocation.

In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.

That be awesome. I was thinking that implementing support for refresh on this field might also be sensible, as this data might change?

I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like ip_allocation set to deferred in this case.

https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html

Could this be the case?

@mnaser Thanks for diving into this issue! This might just be another case in which the ips are not returned with the initial resource create response, but we are not using that.

nikParasyr commented 10 months ago

but likely it's not tackling the root cause of a deferred / delayed IP address allocation.

I think it will tackle the root cause which based on the neutron people from launchpad is:

This result is something expected if the network where the port is created has no subnets

I've actually had deployments where this behavior occurred.

Explaining:

Currently you have this code:

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
}

This only says to TF that the port resource is dependent to the network resource, nothing about the subnet. Based on the dependency graph TF will parallelize the creation of the subnet AND port => their creation will be triggered "at the same time" meaning you are in a race condition (which you have noticed). Sometimes it will work because internally on neutron level the subnet creation will be before the port creation and thus you will get an IP, other times it will be the opposite and you wont get an IP. If you look at your terraform apply output you probably will see something like:

creating network resource
...
network resource **created** (ID=blah_blah)
creating port resource
creating subnet resource <= (triggered at the same time, you are in a race condition)
...

If you add switch your port resource to:

resource "openstack_networking_port_v2" "vpn" {
  name       = "vpn"
  network_id = var.network_id

  admin_state_up     = "true"
  security_group_ids = [openstack_networking_secgroup_v2.vpn.id]

 fixed_ip {
   subnet_id = openstack_networking_subnet_v2.name-here.id
 }
}

This will make known to TF that the port resources is dependant of the subnet => the TF depedency graph will force the subnet creation to be done before it triggers the port creation. So this should remove the race condition. Your terraform apply logs will look like:

creating network resource
...
network resource **created** (ID=blah_blah)
creating subnet resource
...
subnet resource **created** (ID= bluh bluh)
creating port resource <= triggered after the subnet is created and therefore based on neutron people input your port will get an ip now. there is no race condition
...

depends_on will have the same result but it is a bit more pesky to use when you have for_each etc to create multiple resources.

Given the neutron people input, similar behaviors i've noticed and your input (race condition + not using deferred) I am rather certain the above will fix it. I would prefer if you can test the above solution before we consider adding a wait.

frittentheke commented 10 months ago

@nikParasyr sorry for the delay here. I added the subnet_id reference now and the issue seems to not occur anymore. So you were indeed correct.

Thanks for all your time and deep-diving into this mess ;-)

nikParasyr commented 10 months ago

Thank you as well for the patience. I’ve updated the docs so hopefully it will be clear for other users. I’ll close the issue.

terraform-provider-openstack / terraform-provider-openstack