osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

Test OpenStack 2024.1 deployment & upgrade #1073

Open berendt opened 1 month ago

berendt commented 1 month ago
flyersa commented 1 month ago

can you send me detailed instructions please which version i need to pick for the manager etc. and maybe "release upgrade notes" ?

Once i have the infos i get it rolling

berendt commented 1 month ago

can you send me detailed instructions please which version i need to pick for the manager etc. and maybe "release upgrade notes" ?

I'll prepare it and put it in here. It will be in the middle of the week.

berendt commented 1 month ago

@maliblatt We'll use this issue for the 2024.1 deployment & upgrade test results.

berendt commented 1 month ago

Set following parameters in environments/manager/configuration.yml and run make sync afterwards. Commit and push all changes and pull the updated configuration repository on the test cluster. Update the manager with osism update manager as usual. Now you can deploy or upgrade the OpenStack 2024.1 services with osism apply -a upgrade X.

ceph_version: quincy
manager_version: latest
openstack_version: 2024.1
flyersa commented 1 month ago

any particular changelogs for latest that i need to take into consideration upgrading from 7.1.0 to latest ?

ceph btw i will not test, all of our (stackxperts) environments as known use external clusters. I dont think there was alot of change with ceph or ?

berendt commented 1 month ago

any particular changelogs for latest that i need to take into consideration upgrading from 7.1.0 to latest ?

So far I have only seen one secret that you have to add when using Skyline (prometheus_skyline_password).

ceph btw i will not test, all of our (stackxperts) environments as known use external clusters. I dont think there was alot of change with ceph or ?

That‘s fine.

maliblatt commented 1 month ago

I have updated an old test environment to 2024.1 without any issues at it seems. Will do some more testing in the next week. There are indeed only very few kolla-ansible upgrade notes that I had to take into account. One thing I have to take a closer look into is about designate-sink:

`

The configuration variable designate_enable_notifications_sink has been changed to no, configuring notifications for designate in neutron, nova, and control deployment of designate-sink which is now optional.

Operators who want to keep the previous behavior should set this to true.

`

I hope next week I can give some more infos about my testings.

berendt commented 1 month ago

@maliblatt I think it makes sense to set designate_enable_notifications_sink to true in our defaults to keep the old behavior.

flyersa commented 1 month ago

So i installed a completly fresh 2024.1 and also upgraded a 7.1.0 environment. Everything except horizon is at least healthy (no function tests atm) and a few caveeats:

For Horizon i cant figure out why it doesnt work, it complains about memcache i guess:

2024-07-20 13:04:18.001064 /var/lib/kolla/venv/lib/python3.10/site-packages/django/conf/__init__.py:267: RemovedInDjango50Warning: The USE_L10N setting is deprecated. Starting with Django 5.0, localized formatting of data will always be enabled. For example Django will display numbers and dates using the format of the current locale.
2024-07-20 13:04:18.001107   warnings.warn(USE_L10N_DEPRECATED_MSG, RemovedInDjango50Warning)
2024-07-20 13:04:18.122116 /var/lib/kolla/venv/lib/python3.10/site-packages/debreach/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
2024-07-20 13:04:18.122145   version_info = version.StrictVersion(__version__).version
2024-07-20 13:04:18.529227 Internal Server Error: /
2024-07-20 13:04:18.529258 Traceback (most recent call last):
2024-07-20 13:04:18.529261   File "/var/lib/kolla/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
2024-07-20 13:04:18.529263     response = get_response(request)
2024-07-20 13:04:18.529264   File "/var/lib/kolla/venv/lib/python3.10/site-packages/horizon/middleware/simultaneous_sessions.py", line 30, in __call__
2024-07-20 13:04:18.529266     self._process_request(request)
2024-07-20 13:04:18.529267   File "/var/lib/kolla/venv/lib/python3.10/site-packages/horizon/middleware/simultaneous_sessions.py", line 37, in _process_request
2024-07-20 13:04:18.529269     cache_value = cache.get(cache_key)
2024-07-20 13:04:18.529270   File "/var/lib/kolla/venv/lib/python3.10/site-packages/django/core/cache/backends/memcached.py", line 75, in get
2024-07-20 13:04:18.529271     return self._cache.get(key, default)
2024-07-20 13:04:18.529273   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 347, in get
2024-07-20 13:04:18.529275     return self._run_cmd("get", key, default, default=default, **kwargs)
2024-07-20 13:04:18.529276   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 322, in _run_cmd
2024-07-20 13:04:18.529277     return self._safely_run_func(client, func, default_val, *args, **kwargs)
2024-07-20 13:04:18.529279   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 211, in _safely_run_func
2024-07-20 13:04:18.529280     result = func(*args, **kwargs)
2024-07-20 13:04:18.529282   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 687, in get
2024-07-20 13:04:18.529283     return self._fetch_cmd(b"get", [key], False, key_prefix=self.key_prefix).get(
2024-07-20 13:04:18.529284   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 1133, in _fetch_cmd
2024-07-20 13:04:18.529286     self._connect()
2024-07-20 13:04:18.529287   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 424, in _connect
2024-07-20 13:04:18.529289     sock.connect(sockaddr)
2024-07-20 13:04:18.529290 ConnectionRefusedError: [Errno 111] Connection refused

also it seems reconfigure option or deploy does not copy /opt/configuration/environments/kolla/files/overlays/horizon/custom_local_settings properly to the hosts. Doesnt matter what i change in there, it never ends up on the hosts/containers. But this is expected behavior as this was changed to a new format in kolla recently (need to reflected in the release notes later)

Tried manually to change CACHE location, has also no effect. memcache is running fine. Would love to know to were it tries to connect...

did your horizon work @maliblatt ?

looks to me like it awaits memcached to listen on localhost, if i do a netcat with listenport 127.0.0.1 as it looks like from the base client python of the horizon code i get requests when i try to load horizon

root@ctrl01:/etc/kolla/horizon# nc -l 127.0.0.1 11211
get :1:user_pk_None_restrict

If redirect 127.0.0.1:11211 to the appropriate docker container the horizon-error log just dumps source code for some reason without any real error.

UPDATE:

adding this to _9999-custom-settings.py solves the horizon issue:

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.PyMemcacheCache',
        'LOCATION': 'MEMCACHEIP:11211',
    },
}
flyersa commented 1 month ago

Outside of the above i made most tests, magnum cilium works now (despite stil having issues with vexxhost capi driver, but thats not a osism issue). Skyline has a bug that prevents creation of magnum clusters, it passes the flavor ID instead of the flavor name which magnum expects.

Snapshots, Backups and Backup Restore now works again, Live Migration, Loadbalancers, Image Manager all seems to be in working order.

going to test vpnaas with ovn. But looks like i manually have to play around with ovn-neutron-vpn agent, im reusing it in the metadata container for testing for now:

| 8f06be74-f672-58cc-9f5f-84fdfbd7db10 | VPN Agent | hv01 | nova | :-) | UP | neutron-ovn-vpn-agent |

If i can get this to work we need to build a proper neutron-ovn-vpn-agent container, or is there one already ?

maliblatt commented 1 month ago

@flyersa I have the same situation with horizon. Last week I did only the update itself, but did not test yet the functionality. But I can confirm the same problem with horizon to connect to memcache:

2024-07-22 06:59:44.426125 Internal Server Error: /
2024-07-22 06:59:44.426149 Traceback (most recent call last):
2024-07-22 06:59:44.426156   File "/var/lib/kolla/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
2024-07-22 06:59:44.426162     response = get_response(request)
2024-07-22 06:59:44.426168   File "/var/lib/kolla/venv/lib/python3.10/site-packages/horizon/middleware/simultaneous_sessions.py", line 30, in __call__
2024-07-22 06:59:44.426174     self._process_request(request)
2024-07-22 06:59:44.426180   File "/var/lib/kolla/venv/lib/python3.10/site-packages/horizon/middleware/simultaneous_sessions.py", line 37, in _process_request
2024-07-22 06:59:44.426186     cache_value = cache.get(cache_key)
2024-07-22 06:59:44.426191   File "/var/lib/kolla/venv/lib/python3.10/site-packages/django/core/cache/backends/memcached.py", line 75, in get
2024-07-22 06:59:44.426197     return self._cache.get(key, default)
2024-07-22 06:59:44.426203   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 347, in get
2024-07-22 06:59:44.426208     return self._run_cmd("get", key, default, default=default, **kwargs)
2024-07-22 06:59:44.426214   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 322, in _run_cmd
2024-07-22 06:59:44.426220     return self._safely_run_func(client, func, default_val, *args, **kwargs)
2024-07-22 06:59:44.426225   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/hash.py", line 211, in _safely_run_func
2024-07-22 06:59:44.426231     result = func(*args, **kwargs)
2024-07-22 06:59:44.426236   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 687, in get
2024-07-22 06:59:44.426242     return self._fetch_cmd(b"get", [key], False, key_prefix=self.key_prefix).get(
2024-07-22 06:59:44.426248   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 1133, in _fetch_cmd
2024-07-22 06:59:44.426253     self._connect()
2024-07-22 06:59:44.426259   File "/var/lib/kolla/venv/lib/python3.10/site-packages/pymemcache/client/base.py", line 424, in _connect
2024-07-22 06:59:44.426264     sock.connect(sockaddr)
2024-07-22 06:59:44.426270 ConnectionRefusedError: [Errno 111] Connection refused
maliblatt commented 1 month ago

| 8f06be74-f672-58cc-9f5f-84fdfbd7db10 | VPN Agent | hv01 | nova | :-) | UP | neutron-ovn-vpn-agent |

If i can get this to work we need to build a proper neutron-ovn-vpn-agent container, or is there one already ?

Take a look into https://review.opendev.org/c/openstack/kolla/+/924302 ... It seems that a Dockerfile is already on the way :-)

berendt commented 1 month ago

| 8f06be74-f672-58cc-9f5f-84fdfbd7db10 | VPN Agent | hv01 | nova | :-) | UP | neutron-ovn-vpn-agent | If i can get this to work we need to build a proper neutron-ovn-vpn-agent container, or is there one already ?

Take a look into https://review.opendev.org/c/openstack/kolla/+/924302 ... It seems that a Dockerfile is already on the way :-)

Yes, but the whole kolla-ansible part is still missing before that is merged. IMO this is not realistic as a backport for 2024.1. Not directly, at least.

berendt commented 1 month ago
  • designate_enable_notifications_sink true as default would be nice

PR pending. Not quite sure whether we really have this as the default. With Neutron DNS integration, you don't really need it any more.

  • gnocchi has no available images ?

Now online.

flyersa commented 1 month ago

Yes, but the whole kolla-ansible part is still missing before that is merged. IMO this is not realistic as a backport for 2024.1. Not directly, at least.

i dont think we need todo that and wait til its in kolla completed. But i stil want to test the functionality atm. Its not a big problem i can run it for testing in the metadata agent as it has all the parts also installed anyway. So nothing you need todo here atm.

berendt commented 1 month ago

adding this to _9999-custom-settings.py solves the horizon issue:

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.PyMemcacheCache',
        'LOCATION': 'MEMCACHEIP:11211',
    },
}

https://github.com/openstack/kolla-ansible/blob/stable/2024.1/ansible/roles/horizon/templates/_9998-kolla-settings.py.j2#L7-L22

Memcache is only enabled when horizon_backend_database is False. We set horizon_backend_database to True by default. It should not try to reach Memcached at all.

flyersa commented 1 month ago

https://github.com/openstack/kolla-ansible/blob/stable/2024.1/ansible/roles/horizon/templates/_9998-kolla-settings.py.j2#L7-L22

Memcache is only enabled when horizon_backend_database is False. We set horizon_backend_database to True by default. It should not try to reach Memcached at all.

Well i dont have this variable anywhere set and it stil has this issue

berendt commented 1 month ago

Yes, but the whole kolla-ansible part is still missing before that is merged. IMO this is not realistic as a backport for 2024.1. Not directly, at least.

The missing kolla-ansible part: https://review.opendev.org/c/openstack/kolla-ansible/+/924575

berendt commented 1 month ago

This is the default value in Horizon:

SESSION_ENGINE = 'django.contrib.sessions.backends.cache'
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.PyMemcacheCache',
        'LOCATION': '127.0.0.1:11211',
    },
}

SESSION_ENGINE is overwritten by SESSION_ENGINE = 'django.contrib.sessions.backends.db' by default. I think CACHES has always to be configured with working Memcached servers and not only when SESSION_ENGINE = 'django.contrib.sessions.backends.cache'.

Edit: Checked 2023.2. It's the same there. Maybe something in Horizon und Django changed. It works when always setting CACHES.

flyersa commented 1 month ago

so you add the ovn-vpn-agent change now as backport as i see the merge. If the image is available let me know then i test ;)

berendt commented 1 month ago

so you add the ovn-vpn-agent change now as backport as i see the merge. If the image is available let me know then i test ;)

Image + inventory group are available. I have not yet added the part of kolla-ansible as backport.

flyersa commented 1 month ago

yeah with manual changes i didnt have much luck, and also replied to the opendev part. But they guy said with the kolla part works, cant see how yet. Its not a major change to backport the ansible part or?

flyersa commented 1 month ago

I can confirm with the backport OVN VPNaaS works.

image

image

image

image

only downside is stil that they til today only support this crappy old outdated PFS groups :/

flyersa commented 1 month ago

anything else you want me to test?

berendt commented 1 month ago

only downside is stil that they til today only support this crappy old outdated PFS groups :/

Can https://review.opendev.org/c/openstack/neutron-vpnaas/+/898830 help here?

flyersa commented 1 month ago

only downside is stil that they til today only support this crappy old outdated PFS groups :/

Can https://review.opendev.org/c/openstack/neutron-vpnaas/+/898830 help here?

Yes, im following this for a while. I see it got some traction this month again. If they ever decide to add it that will most likely solve it. I think they didnt add it because of some api changes or something which wasnt backwards compatible or something like that.

maliblatt commented 1 month ago

I want to give also some feedback on the VPNaaS: It also works for me, I could establish an IPsec connection which was created via Horizon to an remote IPsec device. With my old test environment I can not give any Infos about network throughput. I think as soon as we will have the first tagged pre release and we will deploy on our plusserver dev environment we can also give some details about performance etc.

beside that everything seems running smooth :-)

flyersa commented 1 month ago

I noticed some issue with magnum which may be by default in the images but breaks it.

apparently there is both, installed magnum-capi-helm and magnum-cluster-api but it should only be magnum-cluster-api. I think this are two different implementations of the magnum capi driver. the magnum-capi-helm one is the one from stackHPC which isnt really working very good, and also requires additional work before use on the capi k8s cluster. The other one is the one from Vexxhost which works without any modifications.

Having both on the same time triggers funny behaviors that sometimes it uses one or the other. magnum-capi-helm as example requires kube_version set on images, the magnum-cluster-api does not. So sometimes it triggers errors, sometimes not.

Also delete will sometimes work and sometimes will not. I think we should only include magnum-cluster-api and skip the helm one.

The Vexxhost driver also has a much better tech approach then the stackHPC one which has alot of own depedencies to stackHpC repositories on github which is not really something we should have.

with only the vexxhost driver and using cilium this pretty much works flawless including rolling upgrades, autoscaling and co.

flyersa commented 1 month ago

magnum-with-capi.pdf Thank you @berendt .

There is stil one issue with magnum now with SQLalchemy which is being adressed right now, maybe there is a fix available before release later.

https://bugs.launchpad.net/magnum/+bug/2067345

also can we maybe add a default override if magnum_enabled is true ?

in order to work properly with normal users outside of the admin role, nova needs a special policy to allow members to create zero disk flavors (nature of some SCS flavors).

os_compute_api:servers:create:zero_disk_flavor: "role:admin or role:member"

needs to be set in nova policy.yaml otherwise normal member users cannot spawn capi instances.

I also added some tiny documentation on how to make CAPI work atm with magnum to this reply. Maybe helps someone, also if this doesnt really belong to this topic.

For the horizon issues (related to magnum) we will try to create a patch for openstack.

btw. i also tested your horizon CACHE changes, horizon deployment works now as expected.

flyersa commented 3 weeks ago

btw there are again newer capi drivers from vexxhost available. Way to go should be to include the "latest" available when building the images.

We have Magnum rolled out now on multiple customers with manual fixes on 2023.2 and it works very good.

berendt commented 3 weeks ago

btw there are again newer capi drivers from vexxhost available. Way to go should be to include the "latest" available when building the images.

We have Magnum rolled out now on multiple customers with manual fixes on 2023.2 and it works very good.

We install the latest available magnum-cluster-api package from Pypi in the latest 2023.2/204.1 Magnum container images. Those images are rebuild every night at the moment. I think the problem is that no never release is available @ Pypi: https://pypi.org/project/magnum-cluster-api/. Lastest release from 19. July 2024 there. I would prefer to not install/use the main branch from https://github.com/vexxhost/magnum-cluster-api.

{% set magnum_base_additional_pip_packages = [ 'magnum-cluster-api' ] %}
dragon@testbed-manager:~$ docker run --rm -it quay.io/osism/magnum-api:2023.2 pip3 list | grep magnum-cluster-api
magnum-cluster-api      0.21.2
dragon@testbed-manager:~$ docker run --rm -it quay.io/osism/magnum-api:2024.1 pip3 list | grep magnum-cluster-api
magnum-cluster-api        0.21.2
flyersa commented 3 weeks ago

So the only show stopper is the SQLalchemy problem they introduced with 2024.1, cant find any other bug report then the launchpad one. But its unusable in 2024.1 in current state.

berendt commented 3 days ago