Cannot bind low ports when running in vm-nvidia

stan7123 commented 1 year ago

When in vm-nvidia runtime I cannot bind ports like 22 or 80. The whoami returns user as root but commands starting sshd (22) or nginx (80) do not work. nginx is returning nginx: [emerg] bind() to 0.0.0.0:80 failed (13: Permission denied).

The same code works for standard vm runtime.

Starting apps on higher ports like 8000 works correctly.

stan7123 commented 11 months ago

Were you able to reproduce/fix this problem?

marmarek commented 11 months ago

This should be fixed in the new image, can you confirm?

stan7123 commented 11 months ago

I tested sshd service and it starts properly on different ports but when trying to connect I get error in logs on runtime side:

Nov 30 16:47:58 (none) sshd[425]: Server listening on 0.0.0.0 port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on :: port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on 0.0.0.0 port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on :: port 8888.
Nov 30 16:48:18 (none) sshd[430]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
Nov 30 16:48:18 (none) sshd[430]: fatal: chroot("/run/sshd"): Function not implemented [preauth]

or

Nov 30 17:10:50 (none) sshd[423]: Server listening on 0.0.0.0 port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on :: port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on 0.0.0.0 port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on :: port 22.
Nov 30 17:10:58 (none) sshd[424]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
Nov 30 17:10:58 (none) sshd[424]: fatal: chroot("/run/sshd"): Function not implemented [preauth]

While on the client

dev@user-Precision-3561:~$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p 2222 -vvv root@0.0.0.0
OpenSSH_8.9p1 Ubuntu-3ubuntu0.4, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /home/dev/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 0.0.0.0 is address
debug3: ssh_connect_direct: entering
debug1: Connecting to 0.0.0.0 [0.0.0.0] port 2222.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: Connection established.
debug1: identity file /home/dev/.ssh/id_rsa type 0
debug1: identity file /home/dev/.ssh/id_rsa-cert type -1
debug1: identity file /home/dev/.ssh/id_ecdsa type -1
debug1: identity file /home/dev/.ssh/id_ecdsa-cert type -1
debug1: identity file /home/dev/.ssh/id_ecdsa_sk type -1
debug1: identity file /home/dev/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /home/dev/.ssh/id_ed25519 type -1
debug1: identity file /home/dev/.ssh/id_ed25519-cert type -1
debug1: identity file /home/dev/.ssh/id_ed25519_sk type -1
debug1: identity file /home/dev/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /home/dev/.ssh/id_xmss type -1
debug1: identity file /home/dev/.ssh/id_xmss-cert type -1
debug1: identity file /home/dev/.ssh/id_dsa type -1
debug1: identity file /home/dev/.ssh/id_dsa-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.4
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.9
debug1: compat_banner: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.9 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 0.0.0.0:2222 as 'root'
debug3: put_host_port: [0.0.0.0]:2222
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug3: order_hostkeyalgs: no algorithms matched; accept original
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
Connection closed by 127.0.0.1 port 2222

I've tried setting UsePrivilegeSeparation no in ssh config but nothing changed.

Dockerfile for runtime is based on: https://github.com/golemfactory/gpu-on-golem-poc/blob/main/rent_gpu/provider/ssh/Dockerfile (and base image is from: https://github.com/norbibi/golem_cuda/blob/master/docker_golem_cuda_base/Dockerfile)

stan7123 commented 11 months ago

Nginx starts properly on port 80.

stan7123 commented 11 months ago

Another finding that might be related is an error while trying to change MTU on provider:

Run /sbin/ifconfig ('eth1', 'mtu', '1450', 'up')' failed on provider; message: 'ExeScript command exited with code 255'; stderr: 'SIOCSIFMTU: Operation not permitted
SIOCSIFFLAGS: Operation not permitted

This command worked on the previous runtime.

norbibi commented 11 months ago

I have the same problem with the new image, I'm unable to connect via ssh whatever the port and runtime (vm or vm-nvidia).

marmarek commented 11 months ago

There is an update in repository fixing this already. You can install it just with apt-get update && apt-get upgrade (or wait a bit for automatic update).

marmarek commented 11 months ago

As for the MTU issue, I see init already sets MTU. Container processes do not have CAP_NET_ADMIN capability (as part of sandboxing), so they cannot change that.

norbibi commented 11 months ago

There is an update in repository fixing this already. You can install it just with apt-get update && apt-get upgrade (or wait a bit for automatic update).

There is a regression (attached logs) with the vm-nvidia runtime after the upgrade (which is not necessarily done correctly since there is not much space on the system partition), apparently on the init binary. 'Offer for preset: vm-nvidia' is not complete. On the other hand it is ok for the ssh connection on the vm runtime.

log_ko_after_upgrade.txt log_ok_before_upgrade.txt

marmarek commented 11 months ago

How large is the system partition on the version you have? We did bumped it to 5.5GB at one point exactly to allow updates to be installed, but maybe you have an earlier build? Can you reinstall package that failed on update, using apt-get --reinstall install <package> (I guess either ya-runtime-vm-nvidia or golem-nvidia-kernel) ? Do apt-get clean first, just to be sure leftovers from previous update are removed.

If that doesn't help, can you post ya-runtime log of that failed offer-template run?

norbibi commented 11 months ago

How large is the system partition on the version you have? We did bumped it to 5.5GB at one point exactly to allow updates to be installed, but maybe you have an earlier build? Can you reinstall package that failed on update, using apt-get --reinstall install <package> (I guess either ya-runtime-vm-nvidia or golem-nvidia-kernel) ? Do apt-get clean first, just to be sure leftovers from previous update are removed.

If that doesn't help, can you post ya-runtime log of that failed offer-template run?

I had several GVMI in cache (of several GB) but I managed to upgrade after a little cleaning. Here is the log file: ya-runtime-vm-nvidia_2023-12-12_06-10-08.log

marmarek commented 11 months ago

That's kernel 6.1.62, it should be 6.1.66 with updates installed.

norbibi commented 11 months ago

That's kernel 6.1.62, it should be 6.1.66 with updates installed.

The vm-nvidia offer is ok again but still not for ssh (vm & vm-nvnidia). Attached are the apt logs. log_apt.txt

marmarek commented 11 months ago

So, now both vm and vm-nvidia have broken ssh, while ssh worked already on vm before? The versions in the log look correct, I think but maybe --reinstall wasn't enough to cleanup failed update (--reinstall is rather hacky way, could mess up with installation order). I'm not sure why "vm" stopped working - it should be completely independent of both ya-runtime-vm-nvidia and golem-nvidia-kernel packages. Maybe try apt-get update && apt-get dist-upgrade one more time, to see if there are any remaining updates. And if not, do reinstall golem-nvidia-kernel one more time (but not ya-runtime-vm-nvidia; the golem-nvidia-kernel specifies it should be installed after ya-runtime-vm-nvidia, but maybe reinstall violated this dependency).

If neither helps, I'd recommend going back to fresh USB image and installing updates again. If you want to preserve data partition, if you use external one, just copy /home/golem/.golemwz.conf from the current system and save to matching location after writing the stick (to the "Golem root filesystem" partition). And possibly setup SSH keys in /home/golem/.ssh/authorized_keys, since you won't get wizard to setup the password. If you use internal data partition (on the USB stick), you can try copying just the root filesystem partition, but in that case, it becomes fragile and I'd recommend simply re-writing the whole USB stick anyway (possibly backing up walled before, if you care about it).

norbibi commented 11 months ago

So, now both vm and vm-nvidia have broken ssh, while ssh worked already on vm before? The versions in the log look correct, I think but maybe --reinstall wasn't enough to cleanup failed update (--reinstall is rather hacky way, could mess up with installation order). I'm not sure why "vm" stopped working - it should be completely independent of both ya-runtime-vm-nvidia and golem-nvidia-kernel packages. Maybe try apt-get update && apt-get dist-upgrade one more time, to see if there are any remaining updates. And if not, do reinstall golem-nvidia-kernel one more time (but not ya-runtime-vm-nvidia; the golem-nvidia-kernel specifies it should be installed after ya-runtime-vm-nvidia, but maybe reinstall violated this dependency).

If neither helps, I'd recommend going back to fresh USB image and installing updates again. If you want to preserve data partition, if you use external one, just copy /home/golem/.golemwz.conf from the current system and save to matching location after writing the stick (to the "Golem root filesystem" partition). And possibly setup SSH keys in /home/golem/.ssh/authorized_keys, since you won't get wizard to setup the password. If you use internal data partition (on the USB stick), you can try copying just the root filesystem partition, but in that case, it becomes fragile and I'd recommend simply re-writing the whole USB stick anyway (possibly backing up walled before, if you care about it).

Hi,

I started from a fresh USD image. I did an upgrade from the start (loggin ssh, stop golemsp service, apt-update & apt-upgrade). The upgrade fails on nvidia-files.squashfs and self-test.gvmi due to lack of space. By first deleting these files before restarting the upgrade, it's ok. Ssh is also ok on vm and vm-nvidia (without error on the nvidia offer).

Thank you for the support and sorry for the inconvenience.

stan7123 commented 11 months ago

SSH works correctly now.

golemfactory / golem-gpu-live

Cannot bind low ports when running in vm-nvidia #6