vexxhost / atmosphere

Simple & easy private cloud platform featuring VMs, Kubernetes & bare-metal
88 stars 23 forks source link

Manila was Unable to create Share #589

Closed migmeneses closed 10 months ago

migmeneses commented 1 year ago

via the Dashboard, created a share with 10GB size the creation process takes a long time ( around 5-10 mins), and never finishes and falls into error.

Steps:

  1. creating a new share:

    2023-09-18 17:29:53.835 7 INFO manila.share.manager [None req-9d440590-8e0f-4f84-aabd-689d8dd8fcea 57bccfa2a8084b79a71b29db37d804ad 1493cf1be9bb40baa5ee249e14fab56f - - - -] Using preexisting share server: '7b861199-4fae-4f9a-9ade-96a4083cf2c0'

    It looks like it is going to re-use an existing instance

  2. It falls into error:

    2023-09-18 17:35:00.546 7 ERROR manila.share.manager [None req-9d440590-8e0f-4f84-aabd-689d8dd8fcea 57bccfa2a8084b79a71b29db37d804ad 1493cf1be9bb40baa5ee249e14fab56f - - - -] Share instance 0be6585c-1c8d-4ec6-b7de-401af2e139f8 failed on creation.: manila.exception.ServiceInstanceUnavailable: Service instance is not available.
    2023-09-18 17:35:00.547 7 WARNING manila.share.manager [None req-9d440590-8e0f-4f84-aabd-689d8dd8fcea 57bccfa2a8084b79a71b29db37d804ad 1493cf1be9bb40baa5ee249e14fab56f - - - -] Share instance information in exception can not be written to db because it contains {} and it is not a dictionary.: manila.exception.ServiceInstanceUnavailable: Service instance is not available.
    2023-09-18 17:35:00.565 7 INFO manila.message.api [None req-9d440590-8e0f-4f84-aabd-689d8dd8fcea 57bccfa2a8084b79a71b29db37d804ad 1493cf1be9bb40baa5ee249e14fab56f - - - -] Creating message record for request_id = req-9d440590-8e0f-4f84-aabd-689d8dd8fcea
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server [None req-9d440590-8e0f-4f84-aabd-689d8dd8fcea 57bccfa2a8084b79a71b29db37d804ad 1493cf1be9bb40baa5ee249e14fab56f - - - -] Exception during message handling: manila.exception.ServiceInstanceUnavailable: Service instance is not available.
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/manila/share/manager.py", line 220, in wrapped
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     return f(self, *args, **kwargs)
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/manila/utils.py", line 579, in wrapper
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     return func(self, *args, **kwargs)
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/manila/share/manager.py", line 2123, in create_share_instance
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     with excutils.save_and_reraise_exception():
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     self.force_reraise()
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     raise self.value
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/manila/share/manager.py", line 2111, in create_share_instance
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     export_locations = self.driver.create_share(
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server   File "/var/lib/openstack/lib/python3.10/site-packages/manila/share/drivers/generic.py", line 113, in wrap
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server     raise exception.ServiceInstanceUnavailable()
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server manila.exception.ServiceInstanceUnavailable: Service instance is not available.
    2023-09-18 17:35:00.580 7 ERROR oslo_messaging.rpc.server

it looks like the error happens when manila tries to re-use an existing instance

migmeneses commented 1 year ago

Listing the Share ID we are looking for is: 06633a54-8ec6-457b-be4e-a60b103ebb7a It is the one we could reproduce the issue

openstack share instance list -c Status -c ID -c "Share ID" -c "Share Server ID" 
+--------------------------------------+--------------------------------------+-----------------+--------------------------------------+
| ID                                   | Share ID                             | Status          | Share Server ID                      |
+--------------------------------------+--------------------------------------+-----------------+--------------------------------------+
| 0be6585c-1c8d-4ec6-b7de-401af2e139f8 | 06633a54-8ec6-457b-be4e-a60b103ebb7a | error           | 7b861199-4fae-4f9a-9ade-96a4083cf2c0 |
| 0d4ad385-9bec-42b0-b98f-27d8714ecb2c | 1e1d41e5-d2b3-42ca-926b-eea9f15de2f2 | extending_error | 1c4e6020-a99e-4e49-ad91-c08b0403738a |
| 81d0fd88-5e4b-44ab-b447-52c3f614404d | 232f1603-089f-43ea-8b47-606599394d21 | error           | 0dd5aa13-485c-4931-abbb-0e74d1fdf731 |
| f58ed3a2-d5c9-4cd2-8080-2eed0bc1db27 | 35b4cdb9-8ae6-41df-98fc-1ed5b65f67e5 | error           | 18a691fd-3b1d-45f0-9d4d-5c49304324b5 |
| 5cfaf4a6-3460-4991-9eb4-39db6a9c04bd | 5976c3b0-cf3a-4ea0-82c4-81b40ec4cdc1 | available       | 7b861199-4fae-4f9a-9ade-96a4083cf2c0 |
| b13e43c8-ee67-44bd-b125-d728932464cc | 5e192a0c-7b4f-4c0f-93f2-591d9b2f2cf7 | extending_error | 54436e7a-2d25-4bd5-be36-9e3049cfad2c |
| 61f541fa-0070-4d99-a0b9-bdd29bb01e3c | 82ffd590-76d1-457e-be3c-29d1591eb2b1 | error           | 1c4e6020-a99e-4e49-ad91-c08b0403738a |
| db69ede0-baae-4528-961a-1b8ac638e885 | 8381dee1-0ed2-4cb9-a164-42475883f7d2 | available       | 7b861199-4fae-4f9a-9ade-96a4083cf2c0 |
| 4bf0a51f-21c9-4cd4-a8e3-624c9602407b | 97aafc2d-1a32-4193-b5a6-afc7b2707725 | extending_error | 864152f0-8e54-408f-b32c-b90734d5145c |
| ce4b2764-80c2-48b2-8c55-40c705643f2e | 9ddc9370-3d74-443a-9f9c-bc0657e4d837 | error           | 7b861199-4fae-4f9a-9ade-96a4083cf2c0 |
| 34c536db-5b1a-483c-b349-ebcb6b6fe43c | c00351a8-ac1d-4c49-bb12-57b10c062ddd | extending_error | 864152f0-8e54-408f-b32c-b90734d5145c |
| 000ec30f-c6a9-40f7-ab2c-bd568eff649a | cd8fff13-3112-4166-b6d1-e6c0e81aff1f | available       | 864152f0-8e54-408f-b32c-b90734d5145c |
| 6781e150-5d6f-47fb-988d-6ca99acd09f5 | d5bbab56-aed3-4a9c-8815-3258e34e4e26 | extending_error | 1c4e6020-a99e-4e49-ad91-c08b0403738a |
| 249c04fd-7134-4911-b8fb-b585933738b8 | e08bf7b3-ae61-47db-ab0b-d80b548b174a | available       | 864152f0-8e54-408f-b32c-b90734d5145c |
| 82074794-0128-4070-8384-c2dca95ba7d6 | e6eae36c-a36b-4dad-b04c-584def02da04 | available       | 54436e7a-2d25-4bd5-be36-9e3049cfad2c |
+--------------------------------------+--------------------------------------+-----------------+--------------------------------------+
openstack server list --all-project | grep 7b861199-4fae-4f9a-9ade-96a4083cf2c0
| e7dc4bc3-46ec-47bf-b256-2eaedffe478e | generic_7b861199-4fae-4f9a-9ade-96a4083cf2c0 | ACTIVE  | manila_service_network=10.254.0.37; sr-default=10.10.1.114                           | manila-service-image     | m1.manila      |

The instance is UP and Running ( ACTIVE ) I was enable to sshed into it

migmeneses commented 1 year ago

ping and check ssh port listeing

root@ctl1:~# ip netns exec qdhcp-6aac9402-b603-4a55-98e3-0d683374f2b8 ping -c 3 10.254.0.37
PING 10.254.0.37 (10.254.0.37) 56(84) bytes of data.
64 bytes from 10.254.0.37: icmp_seq=1 ttl=64 time=2.62 ms
64 bytes from 10.254.0.37: icmp_seq=2 ttl=64 time=0.755 ms
64 bytes from 10.254.0.37: icmp_seq=3 ttl=64 time=0.691 ms

--- 10.254.0.37 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2029ms
rtt min/avg/max/mdev = 0.691/1.356/2.624/0.896 ms

root@ctl1:~# ip netns exec qdhcp-6aac9402-b603-4a55-98e3-0d683374f2b8 nc -vz 10.254.0.37 22
Connection to 10.254.0.37 22 port [tcp/ssh] succeeded!

Inside the instance attached volumes:

manila@ubuntu:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            956M     0  956M   0% /dev
tmpfs           198M  2.1M  196M   2% /run
/dev/vda1       2.7G  1.6G  952M  63% /
tmpfs           986M     0  986M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           986M     0  986M   0% /sys/fs/cgroup
/dev/vdb         98G   20G   73G  22% /shares/share-db69ede0-baae-4528-961a-1b8ac638e885
/dev/vdc         20G   24K   19G   1% /shares/share-5cfaf4a6-3460-4991-9eb4-39db6a9c04bd
tmpfs           198M     0  198M   0% /run/user/1000
manila@ubuntu:~$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda    252:0    0   20G  0 disk 
└─vda1 252:1    0  2.9G  0 part /
vdb    252:16   0  100G  0 disk /shares/share-db69ede0-baae-4528-961a-1b8ac638e885
vdc    252:32   0   40G  0 disk /shares/share-5cfaf4a6-3460-4991-9eb4-39db6a9c04bd

logs:

root@ubuntu:/var/log/samba# cat log.nmbd
[2023/08/18 08:05:59.722908,  0] ../../source3/nmbd/nmbd.c:901(main)
  nmbd version 4.15.13-Ubuntu started.
  Copyright Andrew Tridgell and the Samba Team 1992-2021
[2023/08/18 08:05:59.724571,  0] ../../lib/util/become_daemon.c:150(daemon_status)
  daemon_status: daemon 'nmbd' : No local IPv4 non-loopback interfaces available, waiting for interface ...
[2023/08/18 08:05:59.724586,  0] ../../source3/nmbd/nmbd_subnetdb.c:252(create_subnets)
  NOTE: NetBIOS name resolution is not supported for Internet Protocol Version 6 (IPv6).
[2023/08/18 08:06:49.315253,  0] ../../source3/nmbd/nmbd_become_lmb.c:398(become_local_master_stage2)
  *****

  Samba name server UBUNTU is now a local master browser for workgroup WORKGROUP on subnet 10.254.0.37

  *****
[2023/08/18 08:12:06.647508,  0] ../../source3/nmbd/nmbd_become_lmb.c:398(become_local_master_stage2)
  *****

  Samba name server UBUNTU is now a local master browser for workgroup WORKGROUP on subnet 10.10.1.114
ricolin commented 1 year ago

@migmeneses after tests, I think the behavior is most likely to be triggered by temporary network timeout/block. Which I gonna guess https://github.com/vexxhost/atmosphere/issues/547 also suffered from it.

Here are few tasks we can go from here:

Let me check on the share type one. Regard networks, can you also help me get netstat -s arp -a and full journalctl. or any other data that you found weird. thanks. You can share those in file through Direct message me so we don't overload this page :)

mnaser commented 1 year ago

we were looking way too far :)

+-------------------------+------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                          |
+-------------------------+------------------------------------------------------------------------------------------------+
| admin_state_up          | UP                                                                                             |
| allowed_address_pairs   |                                                                                                |
| binding_host_id         | ctl1                                                                                           |
| binding_profile         |                                                                                                |
| binding_vif_details     |                                                                                                |
| binding_vif_type        | binding_failed                                                                                 |
| binding_vnic_type       | normal                                                                                         |
| created_at              | 2023-08-25T18:20:21Z                                                                           |
| data_plane_status       | None                                                                                           |
| description             |                                                                                                |
| device_id               | manila-share                                                                                   |
| device_owner            | manila:share                                                                                   |
| device_profile          | None                                                                                           |
| dns_assignment          | fqdn='host-10-254-0-10.openstacklocal.', hostname='host-10-254-0-10', ip_address='10.254.0.10' |
|                         | fqdn='host-10-254-0-28.openstacklocal.', hostname='host-10-254-0-28', ip_address='10.254.0.28' |
|                         | fqdn='host-10-254-0-40.openstacklocal.', hostname='host-10-254-0-40', ip_address='10.254.0.40' |
|                         | fqdn='host-10-254-0-59.openstacklocal.', hostname='host-10-254-0-59', ip_address='10.254.0.59' |
|                         | fqdn='host-10-254-0-78.openstacklocal.', hostname='host-10-254-0-78', ip_address='10.254.0.78' |
| dns_domain              |                                                                                                |
| dns_name                |                                                                                                |
| extra_dhcp_opts         |                                                                                                |
| fixed_ips               | ip_address='10.254.0.10', subnet_id='299d2e30-fd9d-40ca-aa0a-eb4bc57bf4ef'                     |
|                         | ip_address='10.254.0.28', subnet_id='742b3d58-4e21-4dc1-9591-5419695197c4'                     |
|                         | ip_address='10.254.0.40', subnet_id='493e4d1a-0d64-4dde-85de-cfa852f6d45f'                     |
|                         | ip_address='10.254.0.59', subnet_id='fb2f3007-6c3d-4300-a403-ccc9e9e40bd7'                     |
|                         | ip_address='10.254.0.78', subnet_id='d1bfda43-4209-45db-a52c-1e53ab669dba'                     |
| id                      | 2425d9ef-af85-4fbc-aaeb-b852c4dde1c5                                                           |
| ip_allocation           | immediate                                                                                      |
| mac_address             | fa:16:3e:6b:ba:75                                                                              |
| name                    |                                                                                                |
| network_id              | 6aac9402-b603-4a55-98e3-0d683374f2b8                                                           |
| numa_affinity_policy    | None                                                                                           |
| port_security_enabled   | False                                                                                          |
| project_id              | e6f5006edea941169a9771e748887742                                                               |
| propagate_uplink_status | None                                                                                           |
| qos_network_policy_id   | None                                                                                           |
| qos_policy_id           | None                                                                                           |
| resource_request        | None                                                                                           |
| revision_number         | 6                                                                                              |
| security_group_ids      |                                                                                                |
| status                  | DOWN                                                                                           |
| tags                    |                                                                                                |
| trunk_details           | None                                                                                           |
| updated_at              | 2023-09-13T10:39:51Z                                                                           |
+-------------------------+------------------------------------------------------------------------------------------------+

the port binding_host_id is not the FQDN and it's failing to bind it successfully. We need to figure out where that's set from Manila and pass that through properly.

mnaser commented 1 year ago

Stinky.

https://github.com/openstack/manila/blob/stable/zed/manila/share/drivers/service_instance.py#L1012-L1033

Manila uses socket.gethostname() with no way of configuring this.

mnaser commented 1 year ago

Filed https://bugs.launchpad.net/manila/+bug/2037580

mnaser commented 1 year ago

Waiting for CI on https://review.opendev.org/c/openstack/manila/+/896692

mnaser commented 12 months ago

Once this lands, we will need to update the CONF.host to be the FQDN when manila-share starts up.

mnaser commented 12 months ago

The Manila chart needs to set [DEFAULT]/host to the FQDN of the system, you can see how this is done in other services like Nova.

https://github.com/vexxhost/atmosphere/blob/ea7e98410b095244a11940f542f82b3fd9a8675c/charts/nova/templates/bin/_nova-compute-init.sh.tpl#L63-L69 https://github.com/vexxhost/atmosphere/blob/ea7e98410b095244a11940f542f82b3fd9a8675c/charts/nova/templates/bin/_nova-compute.sh.tpl#L23-L25

We will also need to follow up on the upstream change + make sure to backport to stable/zed

mnaser commented 12 months ago

https://review.opendev.org/q/I4181a6f1527c80bf356d6363300b2d420921e7fa

requested backports

ricolin commented 11 months ago

track manila work on allow config port host with seperate configs https://review.opendev.org/c/openstack/manila/+/897077

ricolin commented 11 months ago

we might be able to backport https://review.opendev.org/c/openstack/manila/+/897077 after all (see comments inside)

migmeneses commented 11 months ago

I am going to mark this issue as solved As the bug fixed has been released in the new version of manila.

Regards

ricolin commented 11 months ago

We still need this issue to track https://review.opendev.org/c/openstack/manila/+/897077 and https://github.com/vexxhost/atmosphere/pull/668

mnaser commented 10 months ago

all solved now