volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.27k stars 977 forks source link

Volcano on KubeEdge #3519

Open LiShuang-codes opened 5 months ago

LiShuang-codes commented 5 months ago

通常,Volcano 直接部署在Kubernetes上。对于部署了KubeEdge,节点以edge身份接入时,Volcano 并不能正常工作,因为不同边缘节点的pod无法直接通信。我现在在尝试将Volcano 加以修改以支持部署在KubeEdge环境。步骤如下:

  1. 已完成在云端创建ssh隧道。在生成config文件时,为worker分配一个端口号。每个worker都会领到一个端口号。
  2. 待完成worker启动后,自动创建ssh隧道。这个工作出现了一个问题。脚本内容如下:
    
    #!/usr/bin/expect -f

set timeout -1

获取命令行参数

set host [lindex $argv 0] set port [lindex $argv 1] set password [lindex $argv 2]

spawn ssh -vvv -TfNn -R "[::]:$port:localhost:22" $host

expect { "Are you sure you want to continue connecting" { send "yes\r" expect "password:" send "$password\r" } "password:" { send "$password\r" }

}

interact

expect eof

但是不知道为什么,这个脚本确实是启动了 ssh -vvv -TfNn -R "\[::\]:$port:localhost:22" $host,但是之后用`ps -ef`查看进程却消失了。启动日志如下:

spawn ssh -vvv -TfNn -R [::]:16017:localhost:22 192.168.137.195

OpenSSH_8.4p1 Debian-5+deb11u3, OpenSSL 1.1.1w 11 Sep 2023

debug1: Reading configuration data /root/.ssh/config

debug1: Reading configuration data /etc/ssh/ssh_config

debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files

debug1: /etc/ssh/ssh_config line 21: Applying options for *

debug2: resolve_canonicalize: hostname 192.168.137.195 is address

debug2: ssh_connect_direct

debug1: Connecting to 192.168.137.195 [192.168.137.195] port 22.

debug1: Connection established.

debug1: identity file /root/.ssh/id_rsa type 0

debug1: identity file /root/.ssh/id_rsa-cert type -1

debug1: identity file /root/.ssh/id_dsa type -1

debug1: identity file /root/.ssh/id_dsa-cert type -1

debug1: identity file /root/.ssh/id_ecdsa type -1

debug1: identity file /root/.ssh/id_ecdsa-cert type -1

debug1: identity file /root/.ssh/id_ecdsa_sk type -1

debug1: identity file /root/.ssh/id_ecdsa_sk-cert type -1

debug1: identity file /root/.ssh/id_ed25519 type -1

debug1: identity file /root/.ssh/id_ed25519-cert type -1

debug1: identity file /root/.ssh/id_ed25519_sk type -1

debug1: identity file /root/.ssh/id_ed25519_sk-cert type -1

debug1: identity file /root/.ssh/id_xmss type -1

debug1: identity file /root/.ssh/id_xmss-cert type -1

debug1: Local version string SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u3

debug1: Remote protocol version 2.0, remote software version OpenSSH_8.9p1 Ubuntu-3ubuntu0.7

debug1: match: OpenSSH_8.9p1 Ubuntu-3ubuntu0.7 pat OpenSSH* compat 0x04000000

debug2: fd 3 setting O_NONBLOCK

debug1: Authenticating to 192.168.137.195:22 as 'root'

debug3: hostkeys_foreach: reading file "/dev/null"

debug3: send packet: type 20

debug1: SSH2_MSG_KEXINIT sent

debug3: receive packet: type 20

debug1: SSH2_MSG_KEXINIT received

debug2: local client KEXINIT proposal

debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,ext-info-c,kex-strict-c-v00@openssh.com

debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,sk-ecdsa-sha2-nistp256-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,sk-ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,ssh-ed25519,sk-ssh-ed25519@openssh.com,rsa-sha2-512,rsa-sha2-256,ssh-rsa

debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com

debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com

debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1

debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1

debug2: compression ctos: none,zlib@openssh.com,zlib

debug2: compression stoc: none,zlib@openssh.com,zlib

debug2: languages ctos:

debug2: languages stoc:

debug2: first_kex_follows 0

debug2: reserved 0

debug2: peer server KEXINIT proposal

debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,sntrup761x25519-sha512@openssh.com,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,kex-strict-s-v00@openssh.com

debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519

debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com

debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com

debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1

debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1

debug2: compression ctos: none,zlib@openssh.com

debug2: compression stoc: none,zlib@openssh.com

debug2: languages ctos:

debug2: languages stoc:

debug2: first_kex_follows 0

debug2: reserved 0

debug3: kex_choose_conf: will use strict KEX ordering

debug1: kex: algorithm: curve25519-sha256

debug1: kex: host key algorithm: ecdsa-sha2-nistp256

debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none

debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: compression: none

debug3: send packet: type 30

debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

debug3: receive packet: type 31

debug1: Server host key: ecdsa-sha2-nistp256 SHA256:pxza6xCtXtJHnUqEyaUmNsiV9OwFn/MNXA9KQ9txQu4

debug3: hostkeys_foreach: reading file "/dev/null"

Warning: Permanently added '192.168.137.195' (ECDSA) to the list of known hosts.

debug3: send packet: type 21

debug1: ssh_packet_send2_wrapped: resetting send seqnr 3

debug2: set_newkeys: mode 1

debug1: rekey out after 134217728 blocks

debug1: SSH2_MSG_NEWKEYS sent

debug1: expecting SSH2_MSG_NEWKEYS

debug3: receive packet: type 21

debug1: ssh_packet_read_poll2: resetting read seqnr 3

debug1: SSH2_MSG_NEWKEYS received

debug2: set_newkeys: mode 0

debug1: rekey in after 134217728 blocks

debug1: Will attempt key: /root/.ssh/id_rsa RSA SHA256:yVIVDq3NDmL/UCuXbcuBMUyWT3MR8uSNO0cPbhGGsr8

debug1: Will attempt key: /root/.ssh/id_dsa

debug1: Will attempt key: /root/.ssh/id_ecdsa

debug1: Will attempt key: /root/.ssh/id_ecdsa_sk

debug1: Will attempt key: /root/.ssh/id_ed25519

debug1: Will attempt key: /root/.ssh/id_ed25519_sk

debug1: Will attempt key: /root/.ssh/id_xmss

debug2: pubkey_prepare: done

debug3: send packet: type 5

debug3: receive packet: type 7

debug1: SSH2_MSG_EXT_INFO received

debug1: kex_input_ext_info: server-sig-algs=ssh-ed25519,sk-ssh-ed25519@openssh.com,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,webauthn-sk-ecdsa-sha2-nistp256@openssh.com

debug1: kex_input_ext_info: publickey-hostbound@openssh.com (unrecognised)

debug3: receive packet: type 6

debug2: service_accept: ssh-userauth

debug1: SSH2_MSG_SERVICE_ACCEPT received

debug3: send packet: type 50

debug3: receive packet: type 51

debug1: Authentications that can continue: publickey,password

debug3: start over, passed a different list publickey,password

debug3: preferred gssapi-with-mic,publickey,keyboard-interactive,password

debug3: authmethod_lookup publickey

debug3: remaining preferred: keyboard-interactive,password

debug3: authmethod_is_enabled publickey

debug1: Next authentication method: publickey

debug1: Offering public key: /root/.ssh/id_rsa RSA SHA256:yVIVDq3NDmL/UCuXbcuBMUyWT3MR8uSNO0cPbhGGsr8

debug3: send packet: type 50

debug2: we sent a publickey packet, wait for reply

debug3: receive packet: type 51

debug1: Authentications that can continue: publickey,password

debug1: Trying private key: /root/.ssh/id_dsa

debug3: no such identity: /root/.ssh/id_dsa: No such file or directory

debug1: Trying private key: /root/.ssh/id_ecdsa

debug3: no such identity: /root/.ssh/id_ecdsa: No such file or directory

debug1: Trying private key: /root/.ssh/id_ecdsa_sk

debug3: no such identity: /root/.ssh/id_ecdsa_sk: No such file or directory

debug1: Trying private key: /root/.ssh/id_ed25519

debug3: no such identity: /root/.ssh/id_ed25519: No such file or directory

debug1: Trying private key: /root/.ssh/id_ed25519_sk

debug3: no such identity: /root/.ssh/id_ed25519_sk: No such file or directory

debug1: Trying private key: /root/.ssh/id_xmss

debug3: no such identity: /root/.ssh/id_xmss: No such file or directory

debug2: we did not send a packet, disable method

debug3: authmethod_lookup password

debug3: remaining preferred: ,password

debug3: authmethod_is_enabled password

debug1: Next authentication method: password

root@192.168.137.195's password:

debug3: send packet: type 50

debug2: we sent a password packet, wait for reply

debug3: receive packet: type 52

debug1: Authentication succeeded (password).

Authenticated to 192.168.137.195 ([192.168.137.195]:22).

debug1: Remote connections from :::16017 forwarded to local address localhost:22

debug3: send packet: type 80

debug2: fd 3 setting TCP_NODELAY

debug3: ssh_packet_set_tos: set IP_TOS 0x10

debug1: Requesting no-more-sessions@openssh.com

debug3: send packet: type 80

debug1: forking to background

在云端,使用`journalctl -xeu ssh`查看日志:

6月 13 11:56:54 master sshd[13100]: Accepted password for root from 192.168.137.233 port 60926 ssh2 6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) 6月 13 11:56:54 master sshd[13100]: pam_unix(sshd:session): session closed for user root

可以看到这个session创建后立刻就断开了。不知道是为什么。
如果手动进入worker容器中执行`ssh -vvv -TfNn -R [::]:16016:localhost:22 192.168.137.195`,是可以成功的。详细信息不在贴出,因为和上面差不多
3. *有问题*master MPI多任务启动
手动在worker中执行反向端口转发,确保master可以和worker免密ssh通信。甚至我在master上也执行了反向端口转发,woker也能免密ssh到master(虽然这个没有必要)。
但是MPI启动任务就会报错。完整的信息如下:

- 首先,查看config信息,确保主机信息可以拿到
```shell
# cat ~/.ssh/config
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
Host lm-mpi-job-mpimaster-0
  HostName 192.168.137.195
  Port 16016
Host lm-mpi-job-mpiworker-0
  HostName 192.168.137.195
  Port 16017
Host lm-mpi-job-mpiworker-1
  HostName 192.168.137.195
  Port 16018

The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Thu Jun 13 04:29:07 2024 from 127.0.0.1 root@lm-mpi-job-mpiworker-1:~# logout Connection to 192.168.137.195 closed. root@lm-mpi-job-mpimaster-0:/home# ssh lm-mpi-job-mpiworker-0 Warning: Permanently added '[192.168.137.195]:16017' (ECDSA) to the list of known hosts. Linux lm-mpi-job-mpiworker-0 5.4.18-53-generic #42-KYLINOS SMP Fri Mar 4 06:09:02 UTC 2022 aarch64

The programs included with the Debian GNU/Linux system are free software; the exact distribution terms for each program are described in the individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law. Last login: Thu Jun 13 04:31:19 2024 from 127.0.0.1 root@lm-mpi-job-mpiworker-0:~#

 - 启动MPI分布式任务,失败了
```shell
root@lm-mpi-job-mpimaster-0:/home#  mpiexec --allow-run-as-root --host ${MPI_HOST} -np 2 trapWarning: Permanently added '[192.168.137.195]:16017' (ECDSA) to the list of known hosts.
Warning: Permanently added '[192.168.137.195]:16018' (ECDSA) to the list of known hosts.
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    lm-mpi-job-mpiworker-0
  Remote host:   lm-mpi-job-mpimaster-0
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   lm-mpi-job-mpimaster-0
  target node:  lm-mpi-job-mpiworker-1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

只启动一个节点也是这样。但是本地运行没有任何问题

root@lm-mpi-job-mpimaster-0:/home# mpiexec --allow-run-as-root -np 2 trapWith n = 1024 trapezoids, our estimate
of the integral from 0.000000 to 3.000000 = 9.000004291534424e+00

希望有大佬可以指点迷津!

Monokaix commented 5 months ago

seems it's not volcano related problem, have you raised question in kubeedge community?