Closed stan7123 closed 11 months ago
Were you able to reproduce/fix this problem?
This should be fixed in the new image, can you confirm?
I tested sshd
service and it starts properly on different ports but when trying to connect I get error in logs on runtime side:
Nov 30 16:47:58 (none) sshd[425]: Server listening on 0.0.0.0 port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on :: port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on 0.0.0.0 port 8888.
Nov 30 16:47:58 (none) sshd[425]: Server listening on :: port 8888.
Nov 30 16:48:18 (none) sshd[430]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
Nov 30 16:48:18 (none) sshd[430]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
or
Nov 30 17:10:50 (none) sshd[423]: Server listening on 0.0.0.0 port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on :: port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on 0.0.0.0 port 22.
Nov 30 17:10:50 (none) sshd[423]: Server listening on :: port 22.
Nov 30 17:10:58 (none) sshd[424]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
Nov 30 17:10:58 (none) sshd[424]: fatal: chroot("/run/sshd"): Function not implemented [preauth]
While on the client
dev@user-Precision-3561:~$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -p 2222 -vvv root@0.0.0.0
OpenSSH_8.9p1 Ubuntu-3ubuntu0.4, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /home/dev/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 0.0.0.0 is address
debug3: ssh_connect_direct: entering
debug1: Connecting to 0.0.0.0 [0.0.0.0] port 2222.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: Connection established.
debug1: identity file /home/dev/.ssh/id_rsa type 0
debug1: identity file /home/dev/.ssh/id_rsa-cert type -1
debug1: identity file /home/dev/.ssh/id_ecdsa type -1
debug1: identity file /home/dev/.ssh/id_ecdsa-cert type -1
debug1: identity file /home/dev/.ssh/id_ecdsa_sk type -1
debug1: identity file /home/dev/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /home/dev/.ssh/id_ed25519 type -1
debug1: identity file /home/dev/.ssh/id_ed25519-cert type -1
debug1: identity file /home/dev/.ssh/id_ed25519_sk type -1
debug1: identity file /home/dev/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /home/dev/.ssh/id_xmss type -1
debug1: identity file /home/dev/.ssh/id_xmss-cert type -1
debug1: identity file /home/dev/.ssh/id_dsa type -1
debug1: identity file /home/dev/.ssh/id_dsa-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.9p1 Ubuntu-3ubuntu0.4
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.2p1 Ubuntu-4ubuntu0.9
debug1: compat_banner: match: OpenSSH_8.2p1 Ubuntu-4ubuntu0.9 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 0.0.0.0:2222 as 'root'
debug3: put_host_port: [0.0.0.0]:2222
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug3: order_hostkeyalgs: no algorithms matched; accept original
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
Connection closed by 127.0.0.1 port 2222
I've tried setting UsePrivilegeSeparation no
in ssh config but nothing changed.
Dockerfile for runtime is based on: https://github.com/golemfactory/gpu-on-golem-poc/blob/main/rent_gpu/provider/ssh/Dockerfile (and base image is from: https://github.com/norbibi/golem_cuda/blob/master/docker_golem_cuda_base/Dockerfile)
Nginx starts properly on port 80.
Another finding that might be related is an error while trying to change MTU on provider:
Run /sbin/ifconfig ('eth1', 'mtu', '1450', 'up')' failed on provider; message: 'ExeScript command exited with code 255'; stderr: 'SIOCSIFMTU: Operation not permitted
SIOCSIFFLAGS: Operation not permitted
This command worked on the previous runtime.
I have the same problem with the new image, I'm unable to connect via ssh whatever the port and runtime (vm or vm-nvidia).
There is an update in repository fixing this already. You can install it just with apt-get update && apt-get upgrade
(or wait a bit for automatic update).
As for the MTU issue, I see init
already sets MTU. Container processes do not have CAP_NET_ADMIN capability (as part of sandboxing), so they cannot change that.
There is an update in repository fixing this already. You can install it just with
apt-get update && apt-get upgrade
(or wait a bit for automatic update).
There is a regression (attached logs) with the vm-nvidia runtime after the upgrade (which is not necessarily done correctly since there is not much space on the system partition), apparently on the init binary. 'Offer for preset: vm-nvidia' is not complete. On the other hand it is ok for the ssh connection on the vm runtime.
How large is the system partition on the version you have? We did bumped it to 5.5GB at one point exactly to allow updates to be installed, but maybe you have an earlier build?
Can you reinstall package that failed on update, using apt-get --reinstall install <package>
(I guess either ya-runtime-vm-nvidia
or golem-nvidia-kernel
) ? Do apt-get clean
first, just to be sure leftovers from previous update are removed.
If that doesn't help, can you post ya-runtime log of that failed offer-template run?
How large is the system partition on the version you have? We did bumped it to 5.5GB at one point exactly to allow updates to be installed, but maybe you have an earlier build? Can you reinstall package that failed on update, using
apt-get --reinstall install <package>
(I guess eitherya-runtime-vm-nvidia
orgolem-nvidia-kernel
) ? Doapt-get clean
first, just to be sure leftovers from previous update are removed.If that doesn't help, can you post ya-runtime log of that failed offer-template run?
I had several GVMI in cache (of several GB) but I managed to upgrade after a little cleaning. Here is the log file: ya-runtime-vm-nvidia_2023-12-12_06-10-08.log
That's kernel 6.1.62, it should be 6.1.66 with updates installed.
That's kernel 6.1.62, it should be 6.1.66 with updates installed.
The vm-nvidia offer is ok again but still not for ssh (vm & vm-nvnidia). Attached are the apt logs. log_apt.txt
So, now both vm and vm-nvidia have broken ssh, while ssh worked already on vm before? The versions in the log look correct, I think but maybe --reinstall
wasn't enough to cleanup failed update (--reinstall
is rather hacky way, could mess up with installation order). I'm not sure why "vm" stopped working - it should be completely independent of both ya-runtime-vm-nvidia and golem-nvidia-kernel packages.
Maybe try apt-get update && apt-get dist-upgrade
one more time, to see if there are any remaining updates. And if not, do reinstall golem-nvidia-kernel
one more time (but not ya-runtime-vm-nvidia; the golem-nvidia-kernel specifies it should be installed after ya-runtime-vm-nvidia, but maybe reinstall violated this dependency).
If neither helps, I'd recommend going back to fresh USB image and installing updates again.
If you want to preserve data partition, if you use external one, just copy /home/golem/.golemwz.conf
from the current system and save to matching location after writing the stick (to the "Golem root filesystem" partition). And possibly setup SSH keys in /home/golem/.ssh/authorized_keys
, since you won't get wizard to setup the password.
If you use internal data partition (on the USB stick), you can try copying just the root filesystem partition, but in that case, it becomes fragile and I'd recommend simply re-writing the whole USB stick anyway (possibly backing up walled before, if you care about it).
So, now both vm and vm-nvidia have broken ssh, while ssh worked already on vm before? The versions in the log look correct, I think but maybe
--reinstall
wasn't enough to cleanup failed update (--reinstall
is rather hacky way, could mess up with installation order). I'm not sure why "vm" stopped working - it should be completely independent of both ya-runtime-vm-nvidia and golem-nvidia-kernel packages. Maybe tryapt-get update && apt-get dist-upgrade
one more time, to see if there are any remaining updates. And if not, do reinstallgolem-nvidia-kernel
one more time (but not ya-runtime-vm-nvidia; the golem-nvidia-kernel specifies it should be installed after ya-runtime-vm-nvidia, but maybe reinstall violated this dependency).If neither helps, I'd recommend going back to fresh USB image and installing updates again. If you want to preserve data partition, if you use external one, just copy
/home/golem/.golemwz.conf
from the current system and save to matching location after writing the stick (to the "Golem root filesystem" partition). And possibly setup SSH keys in/home/golem/.ssh/authorized_keys
, since you won't get wizard to setup the password. If you use internal data partition (on the USB stick), you can try copying just the root filesystem partition, but in that case, it becomes fragile and I'd recommend simply re-writing the whole USB stick anyway (possibly backing up walled before, if you care about it).
Hi,
I started from a fresh USD image. I did an upgrade from the start (loggin ssh, stop golemsp service, apt-update & apt-upgrade). The upgrade fails on nvidia-files.squashfs and self-test.gvmi due to lack of space. By first deleting these files before restarting the upgrade, it's ok. Ssh is also ok on vm and vm-nvidia (without error on the nvidia offer).
Thank you for the support and sorry for the inconvenience.
SSH works correctly now.
When in
vm-nvidia
runtime I cannot bind ports like 22 or 80. Thewhoami
returns user asroot
but commands starting sshd (22) or nginx (80) do not work.nginx
is returningnginx: [emerg] bind() to 0.0.0.0:80 failed (13: Permission denied)
.The same code works for standard
vm
runtime.Starting apps on higher ports like 8000 works correctly.