usegalaxy-eu / vgcn-infrastructure

Manage the VGCN Infrastructure https://github.com/usegalaxy-eu/vgcn/
GNU General Public License v3.0
4 stars 6 forks source link

packer-builder-qemu plugin: [ERROR] Remote command exited without exit status or exit signal. #178

Open amizeranschi opened 1 year ago

amizeranschi commented 1 year ago

Hello,

I am trying to build a Rocky Linux VGCN image on Ubuntu using Packer v. 1.8.5 (installed via apt) and QEMU built from the latest sources.

TL;DR: the main issue is at the end, but I'm also reporting a couple of minor hurdles I experienced on the way.

When running make first time after cloning this repository, Packer complains about the JSON template. It also suggests how to fix it:

$ make rockylinux-8.x-x86_64/base
** Building template 'rockylinux-8.x-x86_64' using 'qemu' **
/bin/packer build -only=qemu \
    -var-file=base.json \
    -var='vm_name=rockylinux-8.x-x86_64' \
    rockylinux-8.x-x86_64.json
Error: Failed to prepare build: "qemu"

1 error occurred:
    * Deprecated configuration key: 'iso_checksum_type'. Please call `packer fix`
against your template to update your template to be compatible with the current
version of Packer. Visit https://www.packer.io/docs/commands/fix/ for more
detail.

After fixing the JSON template, it looks like the Rocky Linux ISO location has changed:

==> qemu: Retrieving ISO
==> qemu: Trying https://ftp.fau.de/rockylinux/8.6/isos/x86_64/Rocky-8.6-x86_64-boot.iso
==> qemu: Trying https://ftp.fau.de/rockylinux/8.6/isos/x86_64/Rocky-8.6-x86_64-boot.iso?checksum=sha256%3Afe77cc293a2f2fe6ddbf5d4bc2b5c820024869bc7ea274c9e55416d215db0cc5
==> qemu: Download failed bad response code: 404
==> qemu: error downloading ISO: [bad response code: 404]
Build 'qemu' errored after 321 milliseconds 836 microseconds: error downloading ISO: [bad response code: 404]

The new location is mentioned here: https://ftp.fau.de/rockylinux/8.6/README.txt

After fixing the ISO location in the JSON template, the build process manages to start and complete, but it doesn't seem to finish successfully. This is the output:

==> qemu: Connected to SSH!
==> qemu: Gracefully halting virtual machine...
==> qemu: Converting hard drive...
==> qemu: Running post-processor: manifest
Build 'qemu' finished after 12 minutes 39 seconds.

==> Wait completed after 12 minutes 39 seconds

==> Builds finished. The artifacts of successful builds are:
--> qemu: VM files in directory: output-rockylinux-8.x-x86_64-qemu
--> qemu: VM files in directory: output-rockylinux-8.x-x86_64-qemu
make: [Makefile:46: rockylinux-8.x-x86_64/base] Error 1 (ignored)
** Success **

Attempting to launch the image in QEMU fails, with messages like Boot failed: not a bootable disk and No bootable device.

In order to get debug information, I've set export PACKER_LOG=1 and this was the (end of the) build process' output:

==> qemu: Connected to SSH!
2023/01/20 10:29:06 packer-builder-qemu plugin: Running the provision hook
==> qemu: Gracefully halting virtual machine...
2023/01/20 10:29:06 packer-builder-qemu plugin: Executing shutdown command: systemctl poweroff
2023/01/20 10:29:06 packer-builder-qemu plugin: [DEBUG] Opening new ssh session
2023/01/20 10:29:06 packer-builder-qemu plugin: [DEBUG] starting remote command: systemctl poweroff
2023/01/20 10:29:06 packer-builder-qemu plugin: [ERROR] Remote command exited without exit status or exit signal.
2023/01/20 10:29:06 packer-builder-qemu plugin: Waiting max 5m0s for shutdown to complete
2023/01/20 10:29:07 packer-builder-qemu plugin: VM shut down.
==> qemu: Converting hard drive...
2023/01/20 10:29:07 packer-builder-qemu plugin: Executing qemu-img: []string{"convert", "-O", "qcow2", "output-rockylinux-8.x-x86_64-qemu/rockylinux-8.x-x86_64", "output-rockylinux-8.x-x86_64-qemu/rockylinux-8.x-x86_64.convert"}
2023/01/20 10:29:12 packer-builder-qemu plugin: stdout:
2023/01/20 10:29:12 packer-builder-qemu plugin: stderr:
2023/01/20 10:29:15 packer-builder-qemu plugin: failed to unlock port lockfile: close tcp 127.0.0.1:5949: use of closed network connection
2023/01/20 10:29:15 packer-builder-qemu plugin: failed to unlock port lockfile: close tcp 127.0.0.1:3441: use of closed network connection
2023/01/20 10:29:15 [INFO] (telemetry) ending qemu
==> qemu: Running post-processor: manifest
2023/01/20 10:29:15 [INFO] (telemetry) Starting post-processor manifest
2023/01/20 10:29:15 [INFO] (telemetry) ending manifest
2023/01/20 10:29:15 Flagging to keep original artifact from post-processor 'manifest'
==> Wait completed after 9 minutes 42 seconds
==> Builds finished. The artifacts of successful builds are:
2023/01/20 10:29:15 machine readable: qemu,artifact-count []string{"2"}
Build 'qemu' finished after 9 minutes 42 seconds.

==> Wait completed after 9 minutes 42 seconds

==> Builds finished. The artifacts of successful builds are:
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "builder-id", "transcend.qemu"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "id", "VM"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "string", "VM files in directory: output-rockylinux-8.x-x86_64-qemu"}
--> qemu: VM files in directory: output-rockylinux-8.x-x86_64-qemu
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "files-count", "1"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "file", "0", "output-rockylinux-8.x-x86_64-qemu/rockylinux-8.x-x86_64"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"0", "end"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "builder-id", "transcend.qemu"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "id", "VM"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "string", "VM files in directory: output-rockylinux-8.x-x86_64-qemu"}
--> qemu: VM files in directory: output-rockylinux-8.x-x86_64-qemu
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "files-count", "1"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "file", "0", "output-rockylinux-8.x-x86_64-qemu/rockylinux-8.x-x86_64"}
2023/01/20 10:29:15 machine readable: qemu,artifact []string{"1", "end"}
2023/01/20 10:29:15 [INFO] (telemetry) Finalizing.
2023/01/20 10:29:16 waiting for all plugin processes to complete...
2023/01/20 10:29:16 /usr/bin/packer: plugin process exited
2023/01/20 10:29:16 /usr/bin/packer: plugin process exited
make: [Makefile:46: rockylinux-8.x-x86_64/base] Error 1 (ignored)
** Success **
bgruening commented 1 year ago

@sj213 can you have a look at this?

@amizeranschi which version of Rocky do you need. The latest versions of VGCN here is built with Rockey 8.6 afaik.

amizeranschi commented 1 year ago

@bgruening thanks for the reply. The reason I tried to build my own image is that I intend to customize it for my needs. More precisely, I want to add SLURM to it, either alongside HTCondor (preferably), or replacing it, if the two won't be able to coexist.

I found this ansible role for Slurm mentioned in the Galaxy tutorials and I want to try to include it into VGCN. Any advice on whether this is a good idea and how I could go about it would be much appreciated.

bgruening commented 1 year ago

That is a very nice idea. They can coexist and I think we should get this into our images as well. Thanks for working on this. I hope @sj213 can help us here.

sj213 commented 1 year ago

It's been years I last dealt with packer and even back then I never rly digged deep into it, so I'm afraid my guesses are not as informed as you would like. Basically I have no idea what could be going wrong. Error Nr. 1 is EPERM ("Operation not permitted") but I wouldn't be particularly concerned about that one as the Makefile deliberately ignores it, so it is probably harmless and occurs always. Apart from this error, the logs indicate that everything is going according to plan. I have no idea why the generated image won't boot on qemu. Actually I'm not even sure if the generated image is even supposed to be bootable outside of the Openstack cloud. Maybe Openstack extracts the kernel and initrd files from the image, bypassing GRUB, in order to be able to pass command line options to the kernel and in this case the generated image would likely not have a valid boot sector in the first place - but that's just a wild guess, I'm not actually familiar enough with Openstack's inner workings. Sry for the less-than-helpful response...

amizeranschi commented 1 year ago

Hi @sj213, thanks for your reply. I imported the resulting image in OpenStack and created a volume and then an instance from it. The resulting instance appears to have an Active status in OpenStack, but I'm unable to SSH into it and it also doesn't respond to ping, neither on its local IP, nor on a public one (floating IP) that I've assigned to the instance.

For comparison, using the public VGCN images posted here always produced usable machines that I could SSH into, in our OpenStack.

Would you or anyone else here be able to try reproducing this issue?

bgruening commented 1 year ago

Time is the limiting factor currently. Can you create a PR with your changes, maybe we can simply build an image for you? Would that help?

mira-miracoli commented 1 year ago

Hi, I wrote down some of the changes in the docs - there are also a few changes in the kickstart file for rocky 9 and I am still trying to figure out how to solve the ansible connection error in the second build step (rockylinux-9-x-86_64/bwcloud-...) https://github.com/usegalaxy-eu/operations/blob/main/cloud/vgcn.md

mira-miracoli commented 1 year ago

And regarding to the file size, did you complete the second build step with all the ansible roles? I would guess they are adding a big share of the size

mira-miracoli commented 1 year ago

This is where it currently stops working for me with Rocky 9:

==> qemu: Using SSH communicator to connect: 127.0.0.1
==> qemu: Waiting for SSH to become available...
==> qemu: Connected to SSH!
==> qemu: Provisioning with Ansible...
    qemu: Setting up proxy adapter for Ansible....
==> qemu: Executing Ansible: ansible-playbook -e packer_build_name="qemu" -e packer_builder_type=qemu -e packer_http_addr=10.0.2.2:0 --ssh-extra-args '-o IdentitiesOnly=yes' -e ansible_ssh_private_key_file=/tmp/ansible-key3056215714 -i /tmp/packer-provisioner-ansible1337522397 /home/mira/repos/vgcn/ansible-roles/setup-vgcn-bwcloud.yml
    qemu: [DEPRECATION WARNING]: "include" is deprecated, use include_tasks/import_tasks
    qemu: instead. See https://docs.ansible.com/ansible-
    qemu: core/2.14/user_guide/playbooks_reuse_includes.html for details. This feature
    qemu: will be removed in version 2.16. Deprecation warnings can be disabled by
    qemu: setting deprecation_warnings=False in ansible.cfg.
    qemu:
    qemu: PLAY [default] *****************************************************************
    qemu:
    qemu: TASK [Gathering Facts] *********************************************************
==> qemu: failed to handshake
    qemu: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Unable to negotiate with 127.0.0.1 port 33237: no matching host key type found. Their offer: ssh-rsa", "unreachable": true}
    qemu:
    qemu: PLAY RECAP *********************************************************************
    qemu: default                    : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0
    qemu:
==> qemu: Provisioning step had errors: Running the cleanup provisioner, if present...
==> qemu: Deleting output directory...
Build 'qemu' errored after 16 seconds 121 milliseconds: Error executing Ansible: Non-zero exit status: exit status 4

==> Wait completed after 16 seconds 121 milliseconds

==> Some builds didn't complete successfully and had errors:
--> qemu: Error executing Ansible: Non-zero exit status: exit status 4

==> Builds finished but no artifacts were created.
make: *** [Makefile:66: rockylinux-9.x-x86_64/vgcn-bwcloud] Error 1