oVirt / ovirt-ansible-collection

Ansible collection with official oVirt modules and roles
72 stars 90 forks source link

oVirt hosted engine installation failure because of missing netaddr for python 3.11 in ovirt-node image #695

Closed daliborfilus closed 1 year ago

daliborfilus commented 1 year ago
SUMMARY

Don't know if this is the correct place for this issue, plase, redirect me if it isn't.

COMPONENT NAME

oVirt hosted engine installation procedure.

STEPS TO REPRODUCE

As a new user... go to the download page.

1) Download ovirt-node 4.5.4, 4.5.3, 4.5.2 ISO (el8). 2) Install the ovirt-node image, reboot. 3) Run hosted-engine --deploy --4, fill-in the form. 4) After ~twenty minutes, watch it hang on "Wait for the host to be up" message... for another tens of minutes. Then it crashes and rolls back the installation. 5) Optional steps: Pull your hair out trying to find something relevant on this issue via google. Find nothing meaningful. Retry the whole installation multiple times with different versions of the image and network configuration. (I thought the issue was that the engine VM can't connect to the host, which led me to trying to fix non-existing issues with my DNS and other network stuff.) I reinstalled the node image 6+ times total. I re-run the hosted-engine deploy commands 10+ times. Two days of life are gone. 6) Finally discover that you can go to the failed VM directly (it's still present and active) and hunt for logs there. 7) Discover /var/log/ovirt-engine/engine.log. See this inside:

2023-03-22 15:55:34,168+01 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-1) [9c814912-05ec-49eb-8e4f-912347e59f0d] Host installation failed for host 'e2fc0443-9a67-4c11-a11e-791903212bc2', 'ovirtnode1.b-one.cz': Task Install ovs failed to execute. Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20230322155412-ovirtnode1.b-one.cz-9c814912-05ec-49eb-8e4f-912347e59f0d.log

Go to that log and see this inside:

  "stdout" : "fatal: [ovirt-node-1.lan]: FAILED! => {\"msg\": \"The conditional check 'cluster_switch == \\\"ovs\\\" or (ovn_central is defined and ovn_central | ipaddr)' failed. The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller\\n\\nThe error appears to be in '/usr/share/ovirt-engine/ansible-runner-service-project
/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml': line 3, column 5, but may\\nbe elsewhere in the file depending on the exact syntax problem.\\n\\nThe offending line appears to be:\\n\\n- block:\\n  - name: Install ovs\\n    ^ here\\n\"}",

8) So you are telling me it's not because of networking or something hardcore, but a missing package? And you are telling me you couldn't just tell me this problem in the main output directly, without having to hunt it down two layers deep? 9) Curse. Calm down. Find a single metion of this error... in a closed issue, in archived repository. Hmmph.

The real issue is that the task Update all packages also updates ansible on the engine side. That new ansible is now pointing to python3.11, which doesn't have netaddr installed. There is no package python311-netaddr in the repositories. There also isn't even python311-pip.

Before the task "Update all packages", ansible points to python3.8. After that it's on 3.11.

Because the installation creates this new engine VM for you, you have to do any kind of fixing right after the updates are done, but before the ansible is run. The only workaround I found which worked was this:

cp -a /usr/lib/python3.8/site-packages/netaddr* /usr/lib/python3.11/site-packages/

(Optionally, you can install python39-netaddr and copy that instead.)

EXPECTED RESULTS

Installation doesn't fail. If it fails, it tells me the reason directly. A failed installation shouldn't require of you to become expert in oVirt installation details, procedure and internals.

ACTUAL RESULTS

"The host is not up. [..] Go see logs." What logs? Where? It doesn't say.

I'm now an expert in hosted engine installation. Where can I get my certificate? :-)

(P.S. I appreciate all work everyone is doing on this project. I'm just mad after having two days gone on this issue just because of some missing package.)

michalskrivanek commented 1 year ago

yeah, ansible "surprised" us by requiring python 3.11 out of a sudden which requires fair amount of packages to be rebuilt and released. no ETA yet. downgrading ansible would likely work. also, el9stream works fine.

michalskrivanek commented 1 year ago

and sorry about the frustrating experience, ansible has a history of these....

daliborfilus commented 1 year ago

I understand, the dependencies are not in your control and when they break, there's nothing you can do. (Except going the nixos route of static dependencies and forbid users from upgrading packages themselves.) Btw the installation then failed on engine's liveness check and before that it displayed Engine VM IP address is while the engine's he_fqdn ovirtengine.lan resolves to 192.168.88.150. I don't know why is that, but google told me it might be because of broken qemu version (I had qemu-6.2 installed). So either it's another broken dependency or it's (finally) something broken in my network config. I had this error on ovirt-node-4.5.2 ISO, don't know if it isn't fixed in .3 or .4. But it still could be something wrong with my nested KVM setup.

(Offtopic: I need this ovirt installation because I need to write stats gathering app for it - to get host statistics (running VM's, etc.) and show that in Grafana. I found I could do that via vdsm, so that's why I'm installing it in the first place.) Because this is a test setup, I installed the ovirt node image inside bridged libvirt. I read everywhere that the engine's IP must be in the same subnet, although I don't know how the nested qemu can work with that without it being bridged too. But my network stuff knowledge is limited. Maybe the scripts assign the IP on the ovirt node host and passthrough to the nested VM? Don't know.

I gave up on the hosted version and am installing the engine manually right now as we speak. Fingers crossed...

michalskrivanek commented 1 year ago

posted https://github.com/oVirt/ovirt-engine/pull/826. It actually might be enough for the deployment problem. There's no one around who would remember why was it added. so... let's blindly drop it and see what happens...:)

as for the empty address, it means the VM didn't boot up. you'd have to check out the VM...if it's there at all, if it's not stuck in qemu(happens sometimes in CI), try getting to the serial console, maybe the OS has an issue

daliborfilus commented 1 year ago

Well, that was a quick fix. Thank you.

The vm - it was running (virsh list showed "HostedEngine"), but it didn't respond to my assigned IP and net-dhcp-leases didn't show any either). I didn't think of serial console, that could've done the trick. Well, I deleted the VM and went the non-hosted-install route, so I can't check anymore, sadly.

jameswadsworth commented 1 year ago

Some problem here running RHEL 8.7 with ovirt-engine 4.5.4. We were unable to add any additional hosts to the cluster beyond the host we used for the redeployment of the ovirt engine. We resolved in the same way as @daliborfilus by copying the netaddr module from python3.9 to python3.11. We lost a whole nights sleep try debug the issue. We are not ansible/python experts but we know a lot more now!!

simmonscs commented 1 year ago

The option that worked for me was to edit the /etc/dnf/dnf.conf file on the engine VM and add the line exclude=ansible-core to prevent ansible from being updated when the Update all packages task runs. You can do this as soon as the local engine VM gets an IP address.

laduchesneau commented 1 year ago

I lost two days trying to figure out why I couldn't rebuild my lab. Like the OP, I was looking in the wrong direction.

The option that worked for me was to edit the /etc/dnf/dnf.conf file on the engine VM and add the line exclude=ansible-core to prevent ansible from being updated when the Update all packages task runs. You can do this as soon as the local engine VM gets an IP address.

The proposed work around worked for me.

mnecas commented 1 year ago

Just rebuild the ovirt-ansible-collection with python3.11 for el8, it took some time because of the deps. Please let me know if the release 3.1.2-1 will work for you https://github.com/oVirt/ovirt-ansible-collection/pull/697

nodespar commented 1 year ago

@mnecas Are there instructions on how to build the ansible collection with python3.11? Been pulling my hair for the past 1week to get this installed

blablak commented 1 year ago

I found a workaround for this issue. You should start deployment with: hosted-engine --deploy --4 --ansible-extra-vars=he_pause_before_engine_setup=true When deployment pouse you shut conectt to VM using ssh and install mising dependency dnf install python3.11-pip.noarch python3.11 -m pip install netaddr

mnecas commented 1 year ago

@nodespar you can just install it from csb [1] or copr [2] repo. I have already built the collection with python3.11. Don't know right now the release schedule so don't know when it will be in the 4.5 repo [1] https://cbs.centos.org/koji/buildinfo?buildID=43404 [2] https://copr.fedorainfracloud.org/coprs/ovirt/ovirt-master-snapshot/

deepakramanath commented 1 year ago
SUMMARY

Don't know if this is the correct place for this issue, plase, redirect me if it isn't.

COMPONENT NAME

oVirt hosted engine installation procedure.

STEPS TO REPRODUCE

As a new user... go to the download page.

  1. Download ovirt-node 4.5.4, 4.5.3, 4.5.2 ISO (el8).
  2. Install the ovirt-node image, reboot.
  3. Run hosted-engine --deploy --4, fill-in the form.
  4. After ~twenty minutes, watch it hang on "Wait for the host to be up" message... for another tens of minutes. Then it crashes and rolls back the installation.
  5. Optional steps: Pull your hair out trying to find something relevant on this issue via google. Find nothing meaningful. Retry the whole installation multiple times with different versions of the image and network configuration. (I thought the issue was that the engine VM can't connect to the host, which led me to trying to fix non-existing issues with my DNS and other network stuff.) I reinstalled the node image 6+ times total. I re-run the hosted-engine deploy commands 10+ times. Two days of life are gone.
  6. Finally discover that you can go to the failed VM directly (it's still present and active) and hunt for logs there.
  7. Discover /var/log/ovirt-engine/engine.log. See this inside:
2023-03-22 15:55:34,168+01 ERROR [org.ovirt.engine.core.bll.hostdeploy.InstallVdsInternalCommand] (EE-ManagedThreadFactory-engine-Thread-1) [9c814912-05ec-49eb-8e4f-912347e59f0d] Host installation failed for host 'e2fc0443-9a67-4c11-a11e-791903212bc2', 'ovirtnode1.b-one.cz': Task Install ovs failed to execute. Please check logs for more details: /var/log/ovirt-engine/host-deploy/ovirt-host-deploy-ansible-20230322155412-ovirtnode1.b-one.cz-9c814912-05ec-49eb-8e4f-912347e59f0d.log

Go to that log and see this inside:

  "stdout" : "fatal: [ovirt-node-1.lan]: FAILED! => {\"msg\": \"The conditional check 'cluster_switch == \\\"ovs\\\" or (ovn_central is defined and ovn_central | ipaddr)' failed. The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller\\n\\nThe error appears to be in '/usr/share/ovirt-engine/ansible-runner-service-project
/project/roles/ovirt-provider-ovn-driver/tasks/configure.yml': line 3, column 5, but may\\nbe elsewhere in the file depending on the exact syntax problem.\\n\\nThe offending line appears to be:\\n\\n- block:\\n  - name: Install ovs\\n    ^ here\\n\"}",
  1. So you are telling me it's not because of networking or something hardcore, but a missing package? And you are telling me you couldn't just tell me this problem in the main output directly, without having to hunt it down two layers deep?
    1. Curse. Calm down. Find a single metion of this error... in a closed issue, in archived repository. Hmmph.

The real issue is that the task Update all packages also updates ansible on the engine side. That new ansible is now pointing to python3.11, which doesn't have netaddr installed. There is no package python311-netaddr in the repositories. There also isn't even python311-pip.

Before the task "Update all packages", ansible points to python3.8. After that it's on 3.11.

Because the installation creates this new engine VM for you, you have to do any kind of fixing right after the updates are done, but before the ansible is run. The only workaround I found which worked was this:

cp -a /usr/lib/python3.8/site-packages/netaddr* /usr/lib/python3.11/site-packages/

(Optionally, you can install python39-netaddr and copy that instead.)

EXPECTED RESULTS

Installation doesn't fail. If it fails, it tells me the reason directly. A failed installation shouldn't require of you to become expert in oVirt installation details, procedure and internals.

ACTUAL RESULTS

"The host is not up. [..] Go see logs." What logs? Where? It doesn't say.

I'm now an expert in hosted engine installation. Where can I get my certificate? :-)

(P.S. I appreciate all work everyone is doing on this project. I'm just mad after having two days gone on this issue just because of some missing package.)

I'm having the same issue as what you have reported. overt 4.5 on EL-8

deepakramanath commented 1 year ago

I found a workaround for this issue. You should start deployment with: hosted-engine --deploy --4 --ansible-extra-vars=he_pause_before_engine_setup=true When deployment pouse you shut conectt to VM using ssh and install mising dependency dnf install python3.11-pip.noarch python3.11 -m pip install netaddr

I want to give this a try. Do you mean ssh into the engine VM and then install the missing dependency on the engine?

daliborfilus commented 1 year ago

From reddit, "Host is not up" issue: https://www.reddit.com/r/ovirt/comments/12inghq/is_it_still_worth_it_running_ovirt/

ghost commented 1 year ago

When the installation is paused due to _he_pause_before_enginesetup you can grab the IP address of the engine VM and ssh into it and do the command.

bcostescu commented 1 year ago

I'd like to propose a different solution:

hosted-engine --deploy --4 --ansible-extra-vars=he_offline_deployment=true

This will prevent the HE VM from updating packages, such that ansible remains at the original, older version, which guarantees a working deployment (at least for the time being).

Once the deployment finishes, the HE VM will run CentOS Stream 8, as this is what ovirt-engine-appliance is based on. At this point, you can log on normally to the HE VM, using the IP or name you gave for during deployment - you don't need to use the temporary IP. In my case, this was followed by (after logging on to the HE VM):

curl -O https://raw.githubusercontent.com/AlmaLinux/almalinux-deploy/master/almalinux-deploy.sh; bash almalinux-deploy.sh --downgrade

which switches the HE VM to AlmaLinux 8.7. This contains the same ansible version as the host, so things continue to work afterwards. Of course, if your goal is to keep running CentOS Stream 8 in the HE VM, this won't help and you're probably better off with installing the missing python 3.11 module.

michalskrivanek commented 1 year ago

704 dropped netaddr for good, I hope

michalskrivanek commented 1 year ago

with https://www.mail-archive.com/users@ovirt.org/msg72302.html in mind(ovirt-master-snapshot with node image from https://resources.ovirt.org/repos/ovirt/github-ci/ovirt-node-ng-image/ ), it should work from now on on el8stream too. el9stream is working for a while.

daliborfilus commented 1 year ago

Thank you. I agree with going nightly for these cases, because you pratically go "nightly" in "stable" too, because of the included yum update during engine installation. So either going "fully stable" from known versions (but risking security vulnerabilities, where patches for them require yum update anyway), OR going "all latest" are both valid options.

sea2space commented 1 year ago

Hello Thank the gods for finding this thread!

Also killed a couple of days on this problem. Installation from test build 4.5.5 did not help.

The solution suggested by @blablak helped!

michalskrivanek commented 1 year ago

@sea2space what didn't work for you exactly? can you describe the exact OS and package versions and what failed?

sea2space commented 1 year ago

@sea2space what didn't work for you exactly? can you describe the exact OS and package versions and what failed? Update:

Download from https://resources.ovirt.org/repos/ovirt/github-ci/ovirt-node-ng-image/ ovirt-node-ng-installer-4.5.5-2023050307.el8.iso Ran into another error.

Deploy HE again. Now all good. Sorry, my bad.

Waiting for the official stable 4.5.5 )

kriipke commented 1 year ago

I found a workaround for this issue. You should start deployment with: hosted-engine --deploy --4 --ansible-extra-vars=he_pause_before_engine_setup=true When deployment pouse you shut conectt to VM using ssh and install mising dependency dnf install python3.11-pip.noarch python3.11 -m pip install netaddr

^^ This right here worked for me. Finally wrapping up this "afternoon project" 48 hours later smh. Fixed the following error:

The conditional check 'cluster_switch == "ovs" or (ovn_central is defined and ovn_central | ipaddr)' failed.
The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller.
nodespar commented 1 year ago

Is ovirt a dying project? I'm just surprised there has not been a major release with this fix. Anyone coming new to try ovirt cannot install this and go forward.

deepakramanath commented 1 year ago

This method does not work! I have tried it and is not a verified method to mitigate the issue.

On Fri, 19 May 2023 at 03:56, Spencer Smolen @.***> wrote:

I found a workaround for this issue. You should start deployment with: hosted-engine --deploy --4 --ansible-extra-vars=he_pause_before_engine_setup=true When deployment pouse you shut conectt to VM using ssh and install mising dependency dnf install python3.11-pip.noarch python3.11 -m pip install netaddr

^^ This right here worked for me. Finally wrapping up this "afternoon project" 48 hours later smh. Fixed the following error:

The conditional check 'cluster_switch == "ovs" or (ovn_central is defined and ovn_central | ipaddr)' failed. The error was: The ipaddr filter requires python's netaddr be installed on the ansible controller.

— Reply to this email directly, view it on GitHub https://github.com/oVirt/ovirt-ansible-collection/issues/695#issuecomment-1553415792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC5PNJZII2Z7XDX7WFGLLBTXGZPDDANCNFSM6AAAAAAWEM5QLA . You are receiving this because you commented.Message ID: @.***>

Ecsi1337 commented 1 year ago

Currently, the only working solution is to have both the hosts and the hosted-engine version 4.4.10, and then update the hosted-engine to 4.5.4 first, and then the hosts. It is important that you cannot add a 4.4.10 host to the 4.5.4 hosted-engine, everything must be started from 4.4.10.

jorgevisentini commented 1 year ago

Unfortunately, the problem still persists.... Any workaround? Any tips?

Tested the stable iso 4.4.10, 4.5.4, 4.5.3.2

jorgevisentini commented 1 year ago

Currently, the only working solution is to have both the hosts and the hosted-engine version 4.4.10, and then update the hosted-engine to 4.5.4 first, and then the hosts. It is important that you cannot add a 4.4.10 host to the 4.5.4 hosted-engine, everything must be started from 4.4.10.

I tested it just now and it didn't work. In the Engine VM, I added the line "exclude=ansible-core" in the /etc/dnf/dnf.conf file, as @simmonscs commented.

jorgevisentini commented 1 year ago

Just a update... I tested with the ovirt-node-ng-installer-4.5.5-2023070606.el9 and it worked.

I believe that the next releases will work fine... Just tips, download the CentOS 9 Stream repo because we dont know if we will have a change, you know? lol

cgoudie commented 11 months ago

I don't quite know why this issue is closed. Problem persists in Sept 2023 installing ovirt hosts. EL8 and EL9 hosts (Centos Stream)

Edit: To fix you must go to your ovirt host and dnf install python3.11-netaddr.noarch (ovirt host installer didn't automatically update this package -- seems a dependency is missing)

dim-nail commented 11 months ago

Hi, if you trying to install from Centos repo then upgrade ovirt-engine-appliance RPM, on repo they have some bugged version. I've install ovirt-engine-appliance-4.5-20231009063645.1.el9.x86_64.rpm (https://resources.ovirt.org/repos/ovirt/github-ci/ovirt-appliance/el9/) and all installed without any problem and workaround.

vladsol commented 10 months ago

oVirt Node 4.5.4 (stable? :) )

python-netaddr version: 0.9.0 (tried 0.8.0-5.el9 also) same problem: The ipaddr filter requires python's netaddr be installed on the ansible controller.

Tried ovirt 4.5.4 el8 - same problem.

mwperina commented 8 months ago

python-netaddr dependency was removed from oVirt Ansible Collection in 3.1.3 release: https://github.com/oVirt/ovirt-ansible-collection/pull/696

Please make you are using the latest available packages during installation. When installing oVirt Hosted Engine from oVirt Engine Appliance you need to pause the deployment using he_pause_before_engine_setup and perform a dnf update (more details can be found at https://www.ovirt.org/documentation/installing_ovirt_as_a_self-hosted_engine_using_the_command_line/index.html#Deploying_the_Self-Hosted_Engine_Using_the_CLI_install_RHVM)

safodz commented 6 months ago

@daliborfilus Thanks for your workaround I followed step it passed than i got the following error , can you please advise ?

[ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Obtain SSO token using username/password credentials] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Check if the host is up] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Set host_id] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Collect error events from the Engine] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Generate the error message from the engine events] [ INFO ] ok: [localhost] [ INFO ] TASK [ovirt.ovirt.hosted_engine_setup : Fail with error description] [ ERROR ] fatal: [localhost]: FAILED! => {"changed": false, "msg": "The host has been set in non_operational status, deployment errors: code 505: Host rhv2........local installation failed. Failed to configure management network on the host., code 519: Host rhv2.cloudone.cloudz.local does not comply with the cluster Default networks, the following networks are missing on host: 'ovirtmgmt', code 9000: Failed to verify Power Management configuration for Host rhv2.cloudone.cloudz.local., fix accordingly and re-deploy."}

Regards Sofiane

almaclang commented 6 months ago

We're also hitting the same issue. We tried all possible workaround but it fails.

safodz commented 6 months ago

@almaclang for me it works after doing this workaround on the Engine VM (you access it with a temporary IP given during the installation ) if it stucks try to open all the firewall ports (1-9999 tCP/UDP) and verify that DNS resolve and reverse DNS works for Hosts and engine VM

daliborfilus commented 6 months ago

As the OP of this issue, I'm unsubscribing from notifications, becase the issue is closed and I no longer use oVirt. I understand it comes up from searches and I think there should be some kind of FAQ / Discussion thread somewhere more prominent, instead of this closed issue.

almaclang commented 6 months ago

@almaclang fir le it work after doing this work around on the Engine vm (you access it with a temporary IP given during the installation ) if it stucks try to open all the firewall ports (1-9999 tCP/UDP) and verify that DNS resolve and reverse DNS works for Hosts and engine VM my issue is after this stage that with Gluster storage and Engine VM an i can still not detect the problem

@ safodz There's no issue with the DNS, it can reach the public yum repo. It is stuck during the package installation and upgrade. Then suddenly it went to "Wait for the host to be up" state.