redhat-cop / infra.leapp

Collection of Ansible roles for automating RHEL in-place upgrades using Leapp.
MIT License
42 stars 32 forks source link

playbook hanging with reboot phase #172

Open gcccheng opened 3 months ago

gcccheng commented 3 months ago

We are upgrading rhel7 to rhel8, and one of machines is hanging on the rebooting phase for 3 hours(normally 40 minutes), see below output from playbook job

.. TASK [infra.leapp.upgrade : Start Leapp OS upgrade] **** ASYNC POLL on lxrp001.example.com: jid=j641575276660.2466 started=1 finished=0 Waiting for job to be finished. Sleeping for 10 minutes... ASYNC POLL on lxrp001.example.com: jid=j641575276660.2466 started=1 finished=0 .. ASYNC OK on lxrp001.example.com: jid=j641575276660.2466 changed: [ lxrp001.example.com]

TASK [infra.leapp.upgrade : Reboot to continue Leapp OS upgrade] ***
.. on the console of the machine, we could see a login interface with Red Hat Enterprise Linux 8.6 (Ootpa) Kernel 4.18.0-372.32.1.2l8_6.x86_64 on an x86_64

It seems like the upgrading it partially finished and got stuck in the middle.

I checked the infra/leapp/roles/upgrade/tasks/leapp-upgrade.ym file, and it has timeout set as 43260, which is 12 hours.

djdanielsson commented 3 months ago

I think the original idea was to just set it to an extremely long time where if there is any chance that the system is functioning and might come back it would have the time to do so. we probably should variablize those timeouts to allow the users to set it to what they want to allow.

jeffmcutter commented 2 months ago

I agree we should variable-ize those. I can add it to the list, but if someone else wants to have a crack at it, chime in and have at it.

jeffmcutter commented 2 months ago

I'm starting work on this now. Hopefully no duplicates.