Closed mnaser closed 2 years ago
Thanks for that, I left a review, sorry for teh delay.
left a review, good progress @okozachenko1203
@okozachenko1203 could you try and check this locally since I think it's failing to go up properly
@okozachenko1203 could you try and check this locally since I think it's failing to go up properly
sure, i have been doing tests on my lab
@mnaser Please check my answers on your 3 comments https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855. I didn't resolve them to continue discussion.
Btw, after using separate rabbitmq clusters per service, zuul CI failing because of resource lack.
It fails at rabbitmq-octavia
deployment.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 65m default-scheduler 0/8 nodes are available: 3 Insufficient cpu, 5 node(s) didn't match Pod's node affinity/selector.
root@ctl1:/home/ubuntu# kg rabbitmqcluster
NAME ALLREPLICASREADY RECONCILESUCCESS AGE
rabbitmq-barbican True True 101m
rabbitmq-cinder True True 93m
rabbitmq-glance True True 98m
rabbitmq-heat True True 75m
rabbitmq-keystone True True 105m
rabbitmq-neutron True True 88m
rabbitmq-nova True True 84m
rabbitmq-octavia False Unknown 71m
rabbitmq-senlin True True 77m
Resource spec of one rabbitmq cluster is
Limits:
cpu: 2
memory: 2Gi
Requests:
cpu: 1
memory: 2Gi
So i think we need to increase node spec or descrease rabbitmq cluster resource spec.
Which one do you prefer?
octavia role is pending because of this https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/844271
octavia role is pending because of this https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/844271
@okozachenko1203 i see that this is merged. Do we still have blockers here?
I have added a comment related the usage of remote_group
and remote_ip_prefix
here https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855
I have added a comment related the usage of
remote_group
andremote_ip_prefix
here https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855
in reality im thinking here… so the sec group are applied to ports of health manager created on each controller right?
then what actually happens is that the amphoras will have a sec group there and will only contact those ports? if so, maybe the approach of remote_group makes sense…
in this case, this ps is ready for continuing review @mnaser
to fetch tempest log at least https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/848943
openstack/octavia-housekeeping-d8978f76c-jbpgq[octavia-housekeeping]: 2022-07-18 16:43:50.554 1 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='172.24.2.91', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f21f76a27f0>, 'Connection to 172.24.2.91 timed out. (connect timeout=10.0)'))
@mnaser can you help me on this?
ONLINE
but it stuck with provisioning status pending_create
.
root@ctl1:/home/ubuntu# o loadbalancer list
+--------------------------------------+------+----------------------------------+---------------+---------------------+------------------+----------+
| id | name | project_id | vip_address | provisioning_status | operating_status | provider |
+--------------------------------------+------+----------------------------------+---------------+---------------------+------------------+----------+
| 8b83fbf5-5b52-47ea-90aa-e999cc6cd133 | lb1 | eca2e68ae45340f997824c695113ffab | 10.96.250.210 | PENDING_CREATE | ONLINE | amphora |
+--------------------------------------+------+----------------------------------+---------------+---------------------+------------------+----------+
root@ctl1:/home/ubuntu# o port list +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+ | 22c145cd-b12b-40aa-bbfb-944c64c60758 | | fa:16:3e:d5:2a:63 | ip_address='172.24.0.2', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 2e4f48b8-6c5f-4c76-854d-056f3a008d10 | octavia-health-manager-port-ctl2 | fa:16:3e:72:c9:13 | ip_address='172.24.1.104', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 4f0ce54e-634f-40c1-8a76-0c1d40a2863c | | fa:16:3e:16:83:80 | ip_address='172.24.2.150', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 8875ddcb-8230-430f-a9ef-bbe74fadbfa4 | octavia-health-manager-port-ctl3 | fa:16:3e:65:e8:79 | ip_address='172.24.1.208', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | a7e5d08e-9ac3-4f27-bd34-d2cf2d0b4612 | octavia-lb-8b83fbf5-5b52-47ea-90aa-e999cc6cd133 | fa:16:3e:4f:b2:48 | ip_address='10.96.250.210', subnet_id='a9f5e3bc-41e8-4746-acf4-4c65c65a5755' | DOWN | | ce402260-fa7a-42e5-9e61-9be844e601cd | | fa:16:3e:5e:1a:28 | ip_address='172.24.0.4', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | d3efde73-5bae-40dc-bfae-9ef9ea11b79a | octavia-health-manager-port-ctl1 | fa:16:3e:76:55:fa | ip_address='172.24.3.212', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | ff44e947-5b39-4a0c-9d48-96663633cf9d | | fa:16:3e:72:8f:19 | ip_address='172.24.0.3', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+
But lb port is down yet. Not sure down port is the reason of `pending_create` status or that is the result of it.
- I fixed sec groups properly
`5555` should be reachable from all amphora machines so we can set `remote_ip_prefix` using subnet's CIDR for `lb-health-mgr-sec-grp`
`9443` should be reachable from health manager and housekeeping so we can set `remote_ip_prefix` using controller ports(in lb-mgmt net) ip.
- Current status
I tried troubleshooting for this `pending_create` provisioning status, all ports for heartbeat packets and amphora’s api are reachable on controllers. I catched packets on lb-mgmt network and couldn't find any udp packet for heartbeat sent to health manager from amphora.
https://access.redhat.com/solutions/4942351
I wanted to check if this is the same case with us
I created my own keypair and configured octavia to use that for amphora. And i tried to access amphora via ssh.
I can telnet to 22 port and the key is also correct but the connection is closed suddenly.
root@ctl1:/home/ubuntu# ssh ubuntu@172.24.2.150 -vvv
OpenSSH_8.2p1 Ubuntu-4ubuntu0.3, OpenSSL 1.1.1f 31 Mar 2020
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for
debug2: resolve_canonicalize: hostname 172.24.2.150 is address
debug2: ssh_connect_direct
debug1: Connecting to 172.24.2.150 [172.24.2.150] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type 0
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.3
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5
debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 172.24.2.150:22 as 'ubuntu'
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
debug3: receive packet: type 20
debug1: SSH2_MSG_KEXINIT received
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,ext-info-c
debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,ssh-ed25519,sk-ssh-ed25519@openssh.com,rsa-sha2-512,rsa-sha2-256,ssh-rsa
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com,zlib
debug2: compression stoc: none,zlib@openssh.com,zlib
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com
debug2: compression stoc: none,zlib@openssh.com
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC:
I checked MTU but it is ok.
I created an other vm using the amphora image and cirros image on lb-mgmt network and tried to access via ssh but same issue happened.
I think there is some issue in lb-mgmt network but not sure what it is.
I compared the upstream document for networking creation (https://docs.openstack.org/octavia/latest/install/install-ubuntu.html section 7.) and I can see `o-hm0` but cannot find `o-bhm0` on ctls, even on our public clouds.
From the octavia log, i can see only these warnnings from housekeeping and health-manager
openstack/octavia-housekeeping-56f5c48cfc-26f6m[octavia-housekeeping]: 2022-07-19 10:52:35.777 1 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='172.24.0.17', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f68001a2490>: Failed to establish a new connection: [Errno 113] No route to host'))
openstack/octavia-health-manager-default-rp98j[octavia-health-manager]: 2022-07-19 10:52:35.768 2210098 WARNING octavia.controller.healthmanager.health_manager [-] Load balancer 8b83fbf5-5b52-47ea-90aa-e999cc6cd133 is in immutable state PENDING_CREATE. Skipping failover.
housekeeping tries to connect non-existing amphoras' api. I cannot find such warning for existing amphora.
@okozachenko1203 I think in this case, the issue is MTU.. let me propose a theory:
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
inet 10.96.240.110/24 brd 10.96.240.255 scope global dynamic ens3
valid_lft 81199sec preferred_lft 81199sec
The interface that runs the VXLAN network is running with 1450 MTU, but, the interface that is also running the VXLAN network for Octavia which is o-hm0
is also running 1450 MTU, now that means when it needs to leave the o-hm0
interface with a full 1450 MTU packet, it will need to leave the ens3
interface with a 1500 MTU packet.. that can't happen.
I've also tested the following:
# ping -M do -s 1422 172.24.1.104
PING 172.24.1.104 (172.24.1.104) 1422(1450) bytes of data.
^C
--- 172.24.1.104 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6128ms
So you can see that full MTU packets will not work, however, this is something we improved in our cloud lately!
https://vexxhost.com/blog/9000mtus-jumbo-frames-public-cloud/
So I think if you delete this stack and recreate it, you will get internal interfaces that are 9000 MTU, and then you will be able to run pings for larger packets, and I think it will resolve the issue because the timeouts are probably happening because HTTPS is using high MTU.
@mnaser thanks. 👍 i will recreate
@mnaser here is the first commit https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855