vexxhost / atmosphere

Simple & easy private cloud platform featuring VMs, Kubernetes & bare-metal
101 stars 28 forks source link

Add role for Octavia #13

Closed mnaser closed 2 years ago

okozachenko1203 commented 2 years ago

@mnaser here is the first commit https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855

mnaser commented 2 years ago

Thanks for that, I left a review, sorry for teh delay.

okozachenko1203 commented 2 years ago

https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/842384

mnaser commented 2 years ago

left a review, good progress @okozachenko1203

mnaser commented 2 years ago

@okozachenko1203 could you try and check this locally since I think it's failing to go up properly

okozachenko1203 commented 2 years ago

@okozachenko1203 could you try and check this locally since I think it's failing to go up properly

sure, i have been doing tests on my lab

okozachenko1203 commented 2 years ago

@mnaser Please check my answers on your 3 comments https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855. I didn't resolve them to continue discussion.

Btw, after using separate rabbitmq clusters per service, zuul CI failing because of resource lack. It fails at rabbitmq-octavia deployment.

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  65m   default-scheduler  0/8 nodes are available: 3 Insufficient cpu, 5 node(s) didn't match Pod's node affinity/selector.
root@ctl1:/home/ubuntu# kg rabbitmqcluster
NAME                ALLREPLICASREADY   RECONCILESUCCESS   AGE
rabbitmq-barbican   True               True               101m
rabbitmq-cinder     True               True               93m
rabbitmq-glance     True               True               98m
rabbitmq-heat       True               True               75m
rabbitmq-keystone   True               True               105m
rabbitmq-neutron    True               True               88m
rabbitmq-nova       True               True               84m
rabbitmq-octavia    False              Unknown            71m
rabbitmq-senlin     True               True               77m

Resource spec of one rabbitmq cluster is

    Limits:
      cpu:     2
      memory:  2Gi
    Requests:
      cpu:      1
      memory:   2Gi

So i think we need to increase node spec or descrease rabbitmq cluster resource spec.

Which one do you prefer?

okozachenko1203 commented 2 years ago

octavia role is pending because of this https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/844271

guilhermesteinmuller commented 2 years ago

octavia role is pending because of this https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/844271

@okozachenko1203 i see that this is merged. Do we still have blockers here?

guilhermesteinmuller commented 2 years ago

I have added a comment related the usage of remote_group and remote_ip_prefixhere https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855

guilhermesteinmuller commented 2 years ago

I have added a comment related the usage of remote_group and remote_ip_prefixhere https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/840855

in reality im thinking here… so the sec group are applied to ports of health manager created on each controller right?

then what actually happens is that the amphoras will have a sec group there and will only contact those ports? if so, maybe the approach of remote_group makes sense…

okozachenko1203 commented 2 years ago

in this case, this ps is ready for continuing review @mnaser

okozachenko1203 commented 2 years ago

to fetch tempest log at least https://review.opendev.org/c/vexxhost/ansible-collection-atmosphere/+/848943

okozachenko1203 commented 2 years ago

openstack/octavia-housekeeping-d8978f76c-jbpgq[octavia-housekeeping]: 2022-07-18 16:43:50.554 1 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='172.24.2.91', port=9443): Max retries exceeded with url: // (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f21f76a27f0>, 'Connection to 172.24.2.91 timed out. (connect timeout=10.0)'))

okozachenko1203 commented 2 years ago

@mnaser can you help me on this?

root@ctl1:/home/ubuntu# o port list +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+ | ID | Name | MAC Address | Fixed IP Addresses | Status | +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+ | 22c145cd-b12b-40aa-bbfb-944c64c60758 | | fa:16:3e:d5:2a:63 | ip_address='172.24.0.2', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 2e4f48b8-6c5f-4c76-854d-056f3a008d10 | octavia-health-manager-port-ctl2 | fa:16:3e:72:c9:13 | ip_address='172.24.1.104', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 4f0ce54e-634f-40c1-8a76-0c1d40a2863c | | fa:16:3e:16:83:80 | ip_address='172.24.2.150', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | 8875ddcb-8230-430f-a9ef-bbe74fadbfa4 | octavia-health-manager-port-ctl3 | fa:16:3e:65:e8:79 | ip_address='172.24.1.208', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | a7e5d08e-9ac3-4f27-bd34-d2cf2d0b4612 | octavia-lb-8b83fbf5-5b52-47ea-90aa-e999cc6cd133 | fa:16:3e:4f:b2:48 | ip_address='10.96.250.210', subnet_id='a9f5e3bc-41e8-4746-acf4-4c65c65a5755' | DOWN | | ce402260-fa7a-42e5-9e61-9be844e601cd | | fa:16:3e:5e:1a:28 | ip_address='172.24.0.4', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | d3efde73-5bae-40dc-bfae-9ef9ea11b79a | octavia-health-manager-port-ctl1 | fa:16:3e:76:55:fa | ip_address='172.24.3.212', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | | ff44e947-5b39-4a0c-9d48-96663633cf9d | | fa:16:3e:72:8f:19 | ip_address='172.24.0.3', subnet_id='73e96218-92f9-44e5-be0e-8bb1edf33b19' | ACTIVE | +--------------------------------------+-------------------------------------------------+-------------------+------------------------------------------------------------------------------+--------+

But lb port is down yet. Not sure down port is the reason of `pending_create` status or that is the result of it.

- I fixed sec groups properly
`5555` should be reachable from all amphora machines so we can set `remote_ip_prefix` using subnet's CIDR for `lb-health-mgr-sec-grp`
`9443` should be reachable from health manager and housekeeping so we can set  `remote_ip_prefix` using controller ports(in lb-mgmt net) ip.

- Current status
I tried troubleshooting for this `pending_create` provisioning status, all ports for heartbeat packets and amphora’s api are reachable on controllers. I catched packets on lb-mgmt network and couldn't find any udp packet for heartbeat sent to health manager from amphora.

https://access.redhat.com/solutions/4942351
I wanted to check if this is the same case with us
I created my own keypair and configured octavia to use that for amphora. And i tried to access amphora via ssh. 
I can telnet to 22 port and the key is also correct but the connection is closed suddenly.

root@ctl1:/home/ubuntu# ssh ubuntu@172.24.2.150 -vvv OpenSSH_8.2p1 Ubuntu-4ubuntu0.3, OpenSSL 1.1.1f 31 Mar 2020 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/.conf matched no files debug1: /etc/ssh/ssh_config line 21: Applying options for debug2: resolve_canonicalize: hostname 172.24.2.150 is address debug2: ssh_connect_direct debug1: Connecting to 172.24.2.150 [172.24.2.150] port 22. debug1: Connection established. debug1: identity file /root/.ssh/id_rsa type 0 debug1: identity file /root/.ssh/id_rsa-cert type -1 debug1: identity file /root/.ssh/id_dsa type -1 debug1: identity file /root/.ssh/id_dsa-cert type -1 debug1: identity file /root/.ssh/id_ecdsa type -1 debug1: identity file /root/.ssh/id_ecdsa-cert type -1 debug1: identity file /root/.ssh/id_ecdsa_sk type -1 debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1 debug1: identity file /root/.ssh/id_ed25519 type -1 debug1: identity file /root/.ssh/id_ed25519-cert type -1 debug1: identity file /root/.ssh/id_ed25519_sk type -1 debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1 debug1: identity file /root/.ssh/id_xmss type -1 debug1: identity file /root/.ssh/id_xmss-cert type -1 debug1: Local version string SSH-2.0-OpenSSH_8.2p1 Ubuntu-4ubuntu0.3 debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 debug1: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.5 pat OpenSSH* compat 0x04000000 debug2: fd 3 setting O_NONBLOCK debug1: Authenticating to 172.24.2.150:22 as 'ubuntu' debug3: send packet: type 20 debug1: SSH2_MSG_KEXINIT sent debug3: receive packet: type 20 debug1: SSH2_MSG_KEXINIT received debug2: local client KEXINIT proposal debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,ext-info-c debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,ssh-ed25519,sk-ssh-ed25519@openssh.com,rsa-sha2-512,rsa-sha2-256,ssh-rsa debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1 debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1 debug2: compression ctos: none,zlib@openssh.com,zlib debug2: compression stoc: none,zlib@openssh.com,zlib debug2: languages ctos: debug2: languages stoc: debug2: first_kex_follows 0 debug2: reserved 0 debug2: peer server KEXINIT proposal debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256 debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519 debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1 debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1 debug2: compression ctos: none,zlib@openssh.com debug2: compression stoc: none,zlib@openssh.com debug2: languages ctos: debug2: languages stoc: debug2: first_kex_follows 0 debug2: reserved 0 debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ecdsa-sha2-nistp256 debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: compression: none debug3: send packet: type 30 debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

I checked MTU but it is ok. 

I created an other vm using the amphora image and cirros image on lb-mgmt network and tried to access via ssh but same issue happened.

I think there is some issue in lb-mgmt network but not sure what it is. 
I compared the upstream document for networking creation (https://docs.openstack.org/octavia/latest/install/install-ubuntu.html section 7.) and I can see `o-hm0` but cannot find `o-bhm0` on ctls, even on our public clouds.

From the octavia log, i can see only these warnnings from housekeeping and health-manager

openstack/octavia-housekeeping-56f5c48cfc-26f6m[octavia-housekeeping]: 2022-07-19 10:52:35.777 1 WARNING octavia.amphorae.drivers.haproxy.rest_api_driver [-] Could not connect to instance. Retrying.: requests.exceptions.ConnectionError: HTTPSConnectionPool(host='172.24.0.17', port=9443): Max retries exceeded with url: // (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f68001a2490>: Failed to establish a new connection: [Errno 113] No route to host'))

openstack/octavia-health-manager-default-rp98j[octavia-health-manager]: 2022-07-19 10:52:35.768 2210098 WARNING octavia.controller.healthmanager.health_manager [-] Load balancer 8b83fbf5-5b52-47ea-90aa-e999cc6cd133 is in immutable state PENDING_CREATE. Skipping failover.


housekeeping tries to connect non-existing amphoras' api. I cannot find such warning for existing amphora.
mnaser commented 2 years ago

@okozachenko1203 I think in this case, the issue is MTU.. let me propose a theory:

2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    inet 10.96.240.110/24 brd 10.96.240.255 scope global dynamic ens3
       valid_lft 81199sec preferred_lft 81199sec

The interface that runs the VXLAN network is running with 1450 MTU, but, the interface that is also running the VXLAN network for Octavia which is o-hm0 is also running 1450 MTU, now that means when it needs to leave the o-hm0 interface with a full 1450 MTU packet, it will need to leave the ens3 interface with a 1500 MTU packet.. that can't happen.

I've also tested the following:

# ping -M do -s 1422 172.24.1.104
PING 172.24.1.104 (172.24.1.104) 1422(1450) bytes of data.
^C
--- 172.24.1.104 ping statistics ---
7 packets transmitted, 0 received, 100% packet loss, time 6128ms

So you can see that full MTU packets will not work, however, this is something we improved in our cloud lately!

https://vexxhost.com/blog/9000mtus-jumbo-frames-public-cloud/

So I think if you delete this stack and recreate it, you will get internal interfaces that are 9000 MTU, and then you will be able to run pings for larger packets, and I think it will resolve the issue because the timeouts are probably happening because HTTPS is using high MTU.

okozachenko1203 commented 2 years ago

@mnaser thanks. 👍 i will recreate