Closed frittentheke closed 10 months ago
@frittentheke thanks for reporting this.
I speculate that this is occuring because the Openstack api responds to the port creation before an ip gets allocated by the dhcp server, and also the following Get for the port happens also before an ip is allocated.
Can you run the following scenario for me:
@nikParasyr thanks for the fast response!
I did run the scenario you asked for. Full disclosure, there are quite a few resources spawned and the port for the instance running the VPN services is created by a terraform module .... but here we go:
all_fixed_ips
being empty.This is the state for the ressource:
terraform state show module.vpn-server.openstack_networking_port_v2.vpn
# module.vpn-server.openstack_networking_port_v2.vpn:
resource "openstack_networking_port_v2" "vpn" {
admin_state_up = true
all_fixed_ips = []
all_security_group_ids = [
"87acb073-5123-4473-b33b-fc78f522c6b8",
]
all_tags = []
dns_assignment = []
id = "9b37978b-ed53-41c2-983f-31570eb88259"
mac_address = "fa:16:3e:3a:58:ec"
name = "vpn"
network_id = "f946cedc-94d1-4bde-a680-f59d615ad2e3"
port_security_enabled = true
region = "fra"
security_group_ids = [
"87acb073-5123-4473-b33b-fc78f522c6b8",
]
tenant_id = "REDACTED"
allowed_address_pairs {
ip_address = "10.3.4.0/24"
}
binding {
vif_details = {}
vnic_type = "normal"
}
}
so `all_fixed_ips` is empty.
3. Running terraform apply again does not work, it's ending up with the same error about `all_fixed_ips`
4. State remains unchanged.
But we dug a little deeper:
1. `terraform refresh` does NOT update the `all_fixed_ips` (if called implicitly by the apply or explicitly)
2. `terraform apply -target module.vpn-server.openstack_networking_port_v2.vpn` does "work", but finds nothing that needs changing. So also then the field is not populated.
3. Certainly the `terraform state rm module.vpn-server.openstack_networking_port_v2.vpn` which I did initially of course caused a new port to be created leaving the first one dangling. But then `all_fixed_ips` was set, so the other resource referring that worked fine.
4. As I said there a quite a few resources in this terraform code, so I believe this is the reason the port resource does not "work" the same way for the first apply doing everything, but for the second attempt with only this port and the static route being changed / created via the API. Read: convergence time.
This is the terraform debug output / openstack API response to the port creation (initial terraform apply):
[...] 2023-08-23T12:06:29.893+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request URL: GET https://network.regsion.cloud.example.com/v2.0/ports?id=9b37978b-ed53-41c2-983f-31570eb88259: timestamp=2023-08-23T12:06: 29.893+0200 2023-08-23T12:06:29.893+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Request Headers: Accept: application/json Cache-Control: no-cache User-Agent: HashiCorp Terraform/1.5.5 (+https://www.terraform.io) Terraform Plugin SDK/2.10.1 gophercloud/v1.4.0 X-Auth-Token: ***: timestamp=2023-08-23T12:06:29.893+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Code: 200: timestamp=2023-08-23T12:06:29.983+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Headers: Content-Type: application/json Date: Wed, 23 Aug 2023 10:06:29 GMT Server: Apache Strict-Transport-Security: max-age=63072000 Vary: Accept-Encoding Via: 1.1 network.region.cloud.example.com X-Openstack-Request-Id: req-1b02e0f3-442d-4b85-a9ce-40a765d05fb5: timestamp=2023-08-23T12:06:29.983+0200 2023-08-23T12:06:29.983+0200 [INFO] provider.terraform-provider-openstack_v1.52.1: 2023/08/23 12:06:29 [DEBUG] OpenStack Response Body: { "ports": [ { "admin_state_up": true, "allowed_address_pairs": [ { "ip_address": "10.3.4.0/24", "mac_address": "fa:16:3e:3a:58:ec" } ], "binding:vnic_type": "normal", "created_at": "2023-08-23T10:06:12Z", "description": "", "device_id": "", "device_owner": "", "dns_assignment": [], "dns_name": "", "extra_dhcp_opts": [], "fixed_ips": [], "id": "9b37978b-ed53-41c2-983f-31570eb88259", "mac_address": "fa:16:3e:3a:58:ec", "name": "vpn", "network_id": "f946cedc-94d1-4bde-a680-f59d615ad2e3", "port_security_enabled": true, "project_id": "REDACTED", "revision_number": 1, "security_groups": [ "87acb073-5123-4473-b33b-fc78f522c6b8" ], "status": "DOWN", "tags": [], "tenant_id": "REDACTED", "updated_at": "2023-08-23T10:06:12Z" } ] }: timestamp=2023-08-23T12:06:29.983+0200 [...]
From neutron docs i see that port creation returns 201. Also, I am unable to reproduce this on my environment:
2023-08-24T15:50:43.933+0200 [INFO] provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:43 [DEBUG] OpenStack Request Body: {
"port": {
"admin_state_up": true,
"name": "vpn",
"network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a"
}
}: timestamp=2023-08-24T15:50:43.933+0200
2023-08-24T15:50:44.505+0200 [INFO] provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Code: 201: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO] provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Headers:
Content-Length: 699
Content-Type: application/json
Date: Thu, 24 Aug 2023 13:50:44 GMT
X-Openstack-Request-Id: req-7eb97add-533b-46f0-a0d3-31bae4ac65e9: timestamp=2023-08-24T15:50:44.505+0200
2023-08-24T15:50:44.505+0200 [INFO] provider.terraform-provider-openstack_v1.46.0: 2023/08/24 15:50:44 [DEBUG] OpenStack Response Body: {
"port": {
"admin_state_up": true,
"allowed_address_pairs": [],
"binding:vnic_type": "normal",
"created_at": "2023-08-24T13:50:44Z",
"description": "",
"device_id": "",
"device_owner": "",
"extra_dhcp_opts": [],
"fixed_ips": [
{
"ip_address": "192.168.1.102",
"subnet_id": "60882bd0-9597-4433-b841-aad6868d82b7"
}
],
"id": "0c092325-05ee-4dfd-85ec-a9571a37310a",
"mac_address": "fa:16:1e:c7:be:fe",
"name": "vpn",
"network_id": "157c19ff-a568-45bc-88a7-0ec62d5a7a7a",
"port_security_enabled": true,
"project_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
"revision_number": 1,
"security_groups": [
"d6e94844-3231-42ca-bd35-7cc1a68bd095"
],
"status": "DOWN",
"tags": [],
"tenant_id": "ed498e81f0cc448bae0ad4f8f21bf67f",
"updated_at": "2023-08-24T13:50:44Z"
}
}: timestamp=2023-08-24T15:50:44.505+0200
So fixed-ip is already populated in the response, and is written to the state correctly.
@frittentheke what behavior do you get via the cli?
❯ openstack port create --network 157c19ff-a568-45bc-88a7-0ec62d5a7a7a vpn ─╯
+-------------------------+-----------------------------------------------------------------------------+
| Field | Value |
+-------------------------+-----------------------------------------------------------------------------+
| admin_state_up | UP |
| allowed_address_pairs | |
| binding_host_id | None |
| binding_profile | None |
| binding_vif_details | None |
| binding_vif_type | None |
| binding_vnic_type | normal |
| created_at | 2023-08-24T14:13:46Z |
| data_plane_status | None |
| description | |
| device_id | |
| device_owner | |
| device_profile | None |
| dns_assignment | None |
| dns_domain | None |
| dns_name | None |
| extra_dhcp_opts | |
| fixed_ips | ip_address='192.168.1.88', subnet_id='60882bd0-9597-4433-b841-aad6868d82b7' |
| id | 65822420-e3f9-47d2-9b20-63cb6dd9dd4c |
| ip_allocation | None |
| mac_address | fa:16:1e:ad:45:b2 |
| name | vpn |
| network_id | 157c19ff-a568-45bc-88a7-0ec62d5a7a7a |
| numa_affinity_policy | None |
| port_security_enabled | True |
| project_id | ed498e81f0cc448bae0ad4f8f21bf67f |
| propagate_uplink_status | None |
| qos_network_policy_id | None |
| qos_policy_id | None |
| resource_request | None |
| revision_number | 1 |
| security_group_ids | d6e94844-3231-42ca-bd35-7cc1a68bd095 |
| status | DOWN |
| tags | |
| trunk_details | None |
| updated_at | 2023-08-24T14:13:46Z |
+-------------------------+-----------------------------------------------------------------------------+
Also, are you aware if your openstack environment has any specific neutron/dhcp configuration? Anything that could make a port to get an ip after a delay?
Maybe a few basics:
l3_ha=True
), with max_l3_agents_per_router=3
and dhcp_agents_per_network=3
** so maybe the RPC takes a little longer if there are many things to do / apply to the same router?Looking at https://github.com/openstack/neutron/blob/5d97b13c7978c70673d1c886f0c49319076fdec5/neutron/db/models_v2.py#L110 makes me wonder if the port object might be returned without the IPAllocations
if they are not already there when the subquery happens?
Diving into how an API call to create a port is distributed is kind of a rabbit hole ...
There is just so much code dealing with ports and their IPs ...
any I believe some of this is done asynchronously racing the API response for the newly created port and its all_fixed_ips
field.
@nikParasyr Is there any way I could assist more with this issue? Is this potentially even a bug with Neutron returning the port creation response prematurely?
@frittentheke I'm not sure how to tackle this tbh. I'm also busy with some personal stuff for the next 2 weeks.
Is this potentially even a bug with Neutron returning the port creation response prematurely?
It could be, but im not 100%. ( We also run l3_ha x3 in our site and i get an ip instantly)
We could potentially add a wait till the fixed_ip is populated.
@kayrus any ideas?
@nikParasyr thanks again for digging into this issue!
Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially
return the port without fixed_ips
populated?
I raised a bug with Neutron https://bugs.launchpad.net/neutron/+bug/2035230 to ask if this behavior is expected. Would not want to add polling code and timeouts to the provider if this was an API issue in the first place.
Is there a way to ask a Neutron dev if this is intended behavior for the API to potentially return the port without fixed_ips populated?
The bug you opened is a way. there is already a response. Otherwise IRC channels are also an option: https://docs.openstack.org/contributors/common/irc.html
@nikParasyr ... did you see https://bugs.launchpad.net/neutron/+bug/2035230/comments/3 ? So in short: it's expected that the fixed_ips are not initially returned and need to be waited for.
Do you see any chance this could be fixed?
@frittentheke are you creating the subnet in the same run? and if so can you add a depends_on
on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside the fixed_ip
block)?
From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.
In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.
@frittentheke were you successful when adding depends_on
/ defining subnet_id
?
I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like ip_allocation
set to deferred
in this case.
https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html
Could this be the case?
@frittentheke are you creating the subnet in the same run? and if so can you add a
depends_on
on the port resource for the subnet, or alternatively define subnet_id in the port resource => https://registry.terraform.io/providers/terraform-provider-openstack/openstack/latest/docs/resources/networking_port_v2#subnet_id (inside thefixed_ip
block)?From their response this is expected if the subnet is not created. the above 2 options should force the port creation to happen after the subnet creation.
@nikParasyr yes, we are creating a number of networking resources in a single run. So a router, multiple networks and subnets, ...
The issue is not occurring 100% of the time. So it's a race condition. If it occurs it's enough to replace the port resource which then receives the "missing" all_fixed_ips" being the only resource to be created.
So testing with depends_on
is not that easy. Maybe(tm) it does help as it causes some more serialization, but likely it's not tackling the root cause of a deferred / delayed IP address allocation.
In the meantime ill try to find some time to check whether we can/should add a "wait" to the port resource.
That be awesome. I was thinking that implementing support for refresh on this field might also be sensible, as this data might change?
I think a reason for this could be if you use routed provider networks which delegate the selection of the IP address until it's scheduled. You'll see something like
ip_allocation
set todeferred
in this case.https://docs.openstack.org/neutron/latest/admin/config-routed-networks.html
Could this be the case?
@mnaser Thanks for diving into this issue! This might just be another case in which the ips are not returned with the initial resource create response, but we are not using that.
but likely it's not tackling the root cause of a deferred / delayed IP address allocation.
I think it will tackle the root cause which based on the neutron people from launchpad is:
This result is something expected if the network where the port is created has no subnets
I've actually had deployments where this behavior occurred.
Explaining:
Currently you have this code:
resource "openstack_networking_port_v2" "vpn" {
name = "vpn"
network_id = var.network_id
admin_state_up = "true"
security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
}
This only says to TF that the port resource
is dependent to the network resource
, nothing about the subnet. Based on the dependency graph TF will parallelize the creation of the subnet AND port => their creation will be triggered "at the same time" meaning you are in a race condition (which you have noticed). Sometimes it will work because internally on neutron level the subnet creation will be before the port creation and thus you will get an IP, other times it will be the opposite and you wont get an IP. If you look at your terraform apply output you probably will see something like:
creating network resource
...
network resource **created** (ID=blah_blah)
creating port resource
creating subnet resource <= (triggered at the same time, you are in a race condition)
...
If you add switch your port resource to:
resource "openstack_networking_port_v2" "vpn" {
name = "vpn"
network_id = var.network_id
admin_state_up = "true"
security_group_ids = [openstack_networking_secgroup_v2.vpn.id]
fixed_ip {
subnet_id = openstack_networking_subnet_v2.name-here.id
}
}
This will make known to TF that the port resources is dependant of the subnet => the TF depedency graph will force the subnet creation to be done before it triggers the port creation. So this should remove the race condition. Your terraform apply logs will look like:
creating network resource
...
network resource **created** (ID=blah_blah)
creating subnet resource
...
subnet resource **created** (ID= bluh bluh)
creating port resource <= triggered after the subnet is created and therefore based on neutron people input your port will get an ip now. there is no race condition
...
depends_on
will have the same result but it is a bit more pesky to use when you have for_each etc to create multiple resources.
Given the neutron people input, similar behaviors i've noticed and your input (race condition + not using deferred
) I am rather certain the above will fix it. I would prefer if you can test the above solution before we consider adding a wait
.
@nikParasyr sorry for the delay here. I added the subnet_id reference now and the issue seems to not occur anymore. So you were indeed correct.
Thanks for all your time and deep-diving into this mess ;-)
Thank you as well for the patience. I’ve updated the docs so hopefully it will be clear for other users. I’ll close the issue.
Terraform Version
Terraform v1.5.5
Affected Resource(s)
Please list the resources as a list, for example:
Terraform Configuration Files
Debug Output
Panic Output
Expected Behavior
The
openstack_networking_port_v2
is used as interface for an instance providing a VPN service. I port is used in order to have a fixed / known IP address. I would expect the IP of the port_v2 to be returned as first element in theall_fixed_ips
array to then be used the next_hop for a static route.In short: I want to route the network behind the VPN to the corresponding instance via a static route.
Actual Behavior
There are two errors thrown in relation to the port just created:
causing the terraform run to abort with an error.
Steps to Reproduce
terraform apply
the error is reachedterraform state rm openstack_networking_port_v2.vpn
terraform apply
again and things work just fine.It seems the port resource takes longer to be fully created and initialized and the provider moves on too early. Just a refresh on the just created resource? Or some other indication of the port being actually done provisioning has to be tracked via the API?
Important Factoids
References