debug3: no such identity: /root/.ssh/id_ed25519_sk: No such file or directory
debug1: Trying private key: /root/.ssh/id_xmss
debug3: no such identity: /root/.ssh/id_xmss: No such file or directory
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password
root@192.168.137.195's password:
debug3: send packet: type 50
debug2: we sent a password packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (password).
Authenticated to 192.168.137.195 ([192.168.137.195]:22).
debug1: Remote connections from :::16017 forwarded to local address localhost:22
debug3: send packet: type 80
debug2: fd 3 setting TCP_NODELAY
debug3: ssh_packet_set_tos: set IP_TOS 0x10
debug1: Requesting no-more-sessions@openssh.com
debug3: send packet: type 80
debug1: forking to background
在云端,使用`journalctl -xeu ssh`查看日志:
6月 13 11:56:54 master sshd[13100]: Accepted password for root from 192.168.137.233 port 60926 ssh2
6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session closed for user root
可以看到这个session创建后立刻就断开了。不知道是为什么。
如果手动进入worker容器中执行`ssh -vvv -TfNn -R [::]:16016:localhost:22 192.168.137.195`,是可以成功的。详细信息不在贴出,因为和上面差不多
3. *有问题*master MPI多任务启动
手动在worker中执行反向端口转发,确保master可以和worker免密ssh通信。甚至我在master上也执行了反向端口转发,woker也能免密ssh到master(虽然这个没有必要)。
但是MPI启动任务就会报错。完整的信息如下:
- 首先,查看config信息,确保主机信息可以拿到
```shell
# cat ~/.ssh/config
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
Host lm-mpi-job-mpimaster-0
HostName 192.168.137.195
Port 16016
Host lm-mpi-job-mpiworker-0
HostName 192.168.137.195
Port 16017
Host lm-mpi-job-mpiworker-1
HostName 192.168.137.195
Port 16018
root@lm-mpi-job-mpimaster-0:/home# ssh lm-mpi-job-mpiworker-1
Warning: Permanently added '[192.168.137.195]:16018' (ECDSA) to the list of known hosts.
Linux lm-mpi-job-mpiworker-1 4.4.154-116-rockchip-g86a614bc15b3 #1 SMP Mon Jan 10 12:03:08 UTC 2022 aarch64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Jun 13 04:29:07 2024 from 127.0.0.1
root@lm-mpi-job-mpiworker-1:~#
logout
Connection to 192.168.137.195 closed.
root@lm-mpi-job-mpimaster-0:/home# ssh lm-mpi-job-mpiworker-0
Warning: Permanently added '[192.168.137.195]:16017' (ECDSA) to the list of known hosts.
Linux lm-mpi-job-mpiworker-0 5.4.18-53-generic #42-KYLINOS SMP Fri Mar 4 06:09:02 UTC 2022 aarch64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Jun 13 04:31:19 2024 from 127.0.0.1
root@lm-mpi-job-mpiworker-0:~#
- 启动MPI分布式任务,失败了
```shell
root@lm-mpi-job-mpimaster-0:/home# mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 trapWarning: Permanently added '[192.168.137.195]:16017' (ECDSA) to the list of known hosts.
Warning: Permanently added '[192.168.137.195]:16018' (ECDSA) to the list of known hosts.
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
Local host: lm-mpi-job-mpiworker-0
Remote host: lm-mpi-job-mpimaster-0
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: lm-mpi-job-mpimaster-0
target node: lm-mpi-job-mpiworker-1
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
只启动一个节点也是这样。但是本地运行没有任何问题
root@lm-mpi-job-mpimaster-0:/home# mpiexec --allow-run-as-root -np 2 trapWith n = 1024 trapezoids, our estimate
of the integral from 0.000000 to 3.000000 = 9.000004291534424e+00
通常,Volcano 直接部署在Kubernetes上。对于部署了KubeEdge,节点以edge身份接入时,Volcano 并不能正常工作,因为不同边缘节点的pod无法直接通信。我现在在尝试将Volcano 加以修改以支持部署在KubeEdge环境。步骤如下:
set timeout -1
获取命令行参数
set host [lindex $argv 0] set port [lindex $argv 1] set password [lindex $argv 2]
spawn ssh -vvv -TfNn -R "[::]:$port:localhost:22" $host
expect { "Are you sure you want to continue connecting" { send "yes\r" expect "password:" send "$password\r" } "password:" { send "$password\r" }
}
interact
expect eof
spawn ssh -vvv -TfNn -R [::]:16017:localhost:22 192.168.137.195
OpenSSH_8.4p1 Debian-5+deb11u3, OpenSSL 1.1.1w 11 Sep 2023
debug1: Reading configuration data /root/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 192.168.137.195 is address
debug2: ssh_connect_direct
debug1: Connecting to 192.168.137.195 [192.168.137.195] port 22.
debug1: Connection established.
debug1: identity file /root/.ssh/id_rsa type 0
debug1: identity file /root/.ssh/id_rsa-cert type -1
debug1: identity file /root/.ssh/id_dsa type -1
debug1: identity file /root/.ssh/id_dsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa type -1
debug1: identity file /root/.ssh/id_ecdsa-cert type -1
debug1: identity file /root/.ssh/id_ecdsa_sk type -1
debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1
debug1: identity file /root/.ssh/id_ed25519 type -1
debug1: identity file /root/.ssh/id_ed25519-cert type -1
debug1: identity file /root/.ssh/id_ed25519_sk type -1
debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1
debug1: identity file /root/.ssh/id_xmss type -1
debug1: identity file /root/.ssh/id_xmss-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u3
debug1: Remote protocol version 2.0, remote software version OpenSSH_8.9p1 Ubuntu-3ubuntu0.7
debug1: match: OpenSSH_8.9p1 Ubuntu-3ubuntu0.7 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 192.168.137.195:22 as 'root'
debug3: hostkeys_foreach: reading file "/dev/null"
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
debug3: receive packet: type 20
debug1: SSH2_MSG_KEXINIT received
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,ext-info-c,kex-strict-c-v00@openssh.com
debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,ssh-ed25519,sk-ssh-ed25519@openssh.com,rsa-sha2-512,rsa-sha2-256,ssh-rsa
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com,zlib
debug2: compression stoc: none,zlib@openssh.com,zlib
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,sntrup761x25519-sha512@openssh.com,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,kex-strict-s-v00@openssh.com
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com
debug2: compression stoc: none,zlib@openssh.com
debug2: languages ctos:
debug2: languages stoc:
debug2: first_kex_follows 0
debug2: reserved 0
debug3: kex_choose_conf: will use strict KEX ordering
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug3: receive packet: type 31
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:pxza6xCtXtJHnUqEyaUmNsiV9OwFn/MNXA9KQ9txQu4
debug3: hostkeys_foreach: reading file "/dev/null"
Warning: Permanently added '192.168.137.195' (ECDSA) to the list of known hosts.
debug3: send packet: type 21
debug1: ssh_packet_send2_wrapped: resetting send seqnr 3
debug2: set_newkeys: mode 1
debug1: rekey out after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: receive packet: type 21
debug1: ssh_packet_read_poll2: resetting read seqnr 3
debug1: SSH2_MSG_NEWKEYS received
debug2: set_newkeys: mode 0
debug1: rekey in after 134217728 blocks
debug1: Will attempt key: /root/.ssh/id_rsa RSA SHA256:yVIVDq3NDmL/UCuXbcuBMUyWT3MR8uSNO0cPbhGGsr8
debug1: Will attempt key: /root/.ssh/id_dsa
debug1: Will attempt key: /root/.ssh/id_ecdsa
debug1: Will attempt key: /root/.ssh/id_ecdsa_sk
debug1: Will attempt key: /root/.ssh/id_ed25519
debug1: Will attempt key: /root/.ssh/id_ed25519_sk
debug1: Will attempt key: /root/.ssh/id_xmss
debug2: pubkey_prepare: done
debug3: send packet: type 5
debug3: receive packet: type 7
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,webauthn-sk-ecdsa-sha2-nistp256@openssh.com
debug1: kex_input_ext_info: publickey-hostbound@openssh.com (unrecognised)
debug3: receive packet: type 6
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug3: send packet: type 50
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,password
debug3: start over, passed a different list publickey,password
debug3: preferred gssapi-with-mic,publickey,keyboard-interactive,password
debug3: authmethod_lookup publickey
debug3: remaining preferred: keyboard-interactive,password
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Offering public key: /root/.ssh/id_rsa RSA SHA256:yVIVDq3NDmL/UCuXbcuBMUyWT3MR8uSNO0cPbhGGsr8
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey,password
debug1: Trying private key: /root/.ssh/id_dsa
debug3: no such identity: /root/.ssh/id_dsa: No such file or directory
debug1: Trying private key: /root/.ssh/id_ecdsa
debug3: no such identity: /root/.ssh/id_ecdsa: No such file or directory
debug1: Trying private key: /root/.ssh/id_ecdsa_sk
debug3: no such identity: /root/.ssh/id_ecdsa_sk: No such file or directory
debug1: Trying private key: /root/.ssh/id_ed25519
debug3: no such identity: /root/.ssh/id_ed25519: No such file or directory
debug1: Trying private key: /root/.ssh/id_ed25519_sk
debug3: no such identity: /root/.ssh/id_ed25519_sk: No such file or directory
debug1: Trying private key: /root/.ssh/id_xmss
debug3: no such identity: /root/.ssh/id_xmss: No such file or directory
debug2: we did not send a packet, disable method
debug3: authmethod_lookup password
debug3: remaining preferred: ,password
debug3: authmethod_is_enabled password
debug1: Next authentication method: password
root@192.168.137.195's password:
debug3: send packet: type 50
debug2: we sent a password packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (password).
Authenticated to 192.168.137.195 ([192.168.137.195]:22).
debug1: Remote connections from :::16017 forwarded to local address localhost:22
debug3: send packet: type 80
debug2: fd 3 setting TCP_NODELAY
debug3: ssh_packet_set_tos: set IP_TOS 0x10
debug1: Requesting no-more-sessions@openssh.com
debug3: send packet: type 80
debug1: forking to background
6月 13 11:56:54 master sshd[13100]: Accepted password for root from 192.168.137.233 port 60926 ssh2 6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) 6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session closed for user root
The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Thu Jun 13 04:29:07 2024 from 127.0.0.1 root@lm-mpi-job-mpiworker-1:~# logout Connection to 192.168.137.195 closed. root@lm-mpi-job-mpimaster-0:/home# ssh lm-mpi-job-mpiworker-0 Warning: Permanently added '[192.168.137.195]:16017' (ECDSA) to the list of known hosts. Linux lm-mpi-job-mpiworker-0 5.4.18-53-generic #42-KYLINOS SMP Fri Mar 4 06:09:02 UTC 2022 aarch64
The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Thu Jun 13 04:31:19 2024 from 127.0.0.1 root@lm-mpi-job-mpiworker-0:~#
只启动一个节点也是这样。但是本地运行没有任何问题
希望有大佬可以指点迷津!