mila-iqia / milatools

Tools to connect to and interact with the Mila cluster
MIT License
63 stars 12 forks source link

Mila code doesn't work with --alloc --nodes>1 #22

Closed lebrice closed 2 years ago

lebrice commented 2 years ago
$ mila code /network/scratch/n/normandf/imagenet_template --alloc --cpus-per-task=4 --gres=gpu:1 --mem=16G --nodes 2
(local) $ ssh mila -fNMS /home/fabrice/.ssh/sockets/milatools.mila
(mila) $ salloc --cpus-per-task=4 --gres=gpu:1 --mem=16G --nodes 2
# Control socket connect(/home/fabrice/.ssh/sockets/milatools.mila): Connection refused
# salloc: --------------------------------------------------------------------------------------------------
# salloc: # Using default long partition
# salloc: --------------------------------------------------------------------------------------------------
# salloc: Pending job allocation 2149062
# salloc: job 2149062 queued and waiting for resources
# salloc: job 2149062 has been allocated resources
# salloc: Granted job allocation 2149062
# salloc: Waiting for resource configuration
# salloc: Nodes cn-c[007,035] are ready for job
(local) $ code --remote 'ssh-remote+cn-c[007,035].server.mila.quebec' /network/scratch/n/normandf/imagenet_template

VSCode isn't able to connect to the host. This is probably due to how the node name is retrieved inside the mila code command.

Here is some of the output inside the VsCode: Remote - SSH log window:

[10:46:47.518] ------

[10:46:47.518] SSH Resolver called for "ssh-remote+cn-c[007,035].server.mila.quebec", attempt 5, (Reconnection)
[10:46:47.519] SSH Resolver called for host: cn-c[007,035].server.mila.quebec
[10:46:47.519] Setting up SSH remote "cn-c[007,035].server.mila.quebec"
[10:46:47.520] Acquiring local install lock: /tmp/vscode-remote-ssh-855227e0-install.lock
[10:46:47.520] Looking for existing server data file at /home/fabrice/.config/Code/User/globalStorage/ms-vscode-remote.remote-ssh/vscode-ssh-host-855227e0-6d9b74a70ca9c7733b29f0456fd8195364076dda-0.84.0/data.json
[10:46:47.520] Using commit id "6d9b74a70ca9c7733b29f0456fd8195364076dda" and quality "stable" for server
[10:46:47.522] Install and start server if needed
[10:46:47.527] askpass server listening on /run/user/1001/vscode-ssh-askpass-e7e86db49903dbff72edcece1901eb8c786021c3.sock
[10:46:47.528] Spawning local server with {"serverId":5,"ipcHandlePath":"/run/user/1001/vscode-ssh-askpass-bace1ed1c6ac07b4dc3b702a1e5f6f1d66ccc9e5.sock","sshCommand":"ssh","sshArgs":["-v","-T","-D","36461","-o","ConnectTimeout=15","cn-c[007,035].server.mila.quebec"],"serverDataFolderName":".vscode-server","dataFilePath":"/home/fabrice/.config/Code/User/globalStorage/ms-vscode-remote.remote-ssh/vscode-ssh-host-855227e0-6d9b74a70ca9c7733b29f0456fd8195364076dda-0.84.0/data.json"}
[10:46:47.528] Local server env: {"SSH_AUTH_SOCK":"/run/user/1001/keyring/ssh","SHELL":"/bin/bash","DISPLAY":":1","ELECTRON_RUN_AS_NODE":"1","SSH_ASKPASS":"/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/local-server/askpass.sh","VSCODE_SSH_ASKPASS_NODE":"/usr/share/code/code","VSCODE_SSH_ASKPASS_EXTRA_ARGS":"--ms-enable-electron-run-as-node","VSCODE_SSH_ASKPASS_MAIN":"/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/askpass-main.js","VSCODE_SSH_ASKPASS_HANDLE":"/run/user/1001/vscode-ssh-askpass-e7e86db49903dbff72edcece1901eb8c786021c3.sock"}
[10:46:47.533] Spawned 6736
[10:46:47.630] > local-server-5> Spawned ssh, pid=6744
[10:46:47.634] stderr> OpenSSH_8.2p1 Ubuntu-4ubuntu0.5, OpenSSL 1.1.1f  31 Mar 2020
[10:46:47.635] stderr> Bad stdio forwarding specification '[cn-c[007,035].server.mila.quebec]:22'
[10:46:47.635] stderr> kex_exchange_identification: Connection closed by remote host
[10:46:47.635] > local-server-5> ssh child died, shutting down
[10:46:47.641] Local server exit: 0
[10:46:47.642] Received install output: local-server-5> Spawned ssh, pid=6744
OpenSSH_8.2p1 Ubuntu-4ubuntu0.5, OpenSSL 1.1.1f  31 Mar 2020
Bad stdio forwarding specification '[cn-c[007,035].server.mila.quebec]:22'
kex_exchange_identification: Connection closed by remote host
local-server-5> ssh child died, shutting down

[10:46:47.642] Failed to parse remote port from server output
[10:46:47.643] Resolver error: Error: 
    at Function.Create (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:585222)
    at Object.t.handleInstallOutput (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:583874)
    at Object.e [as tryInstallWithLocalServer] (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:624373)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async /home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:643506
    at async Object.t.withShowDetailsEvent (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:647224)
    at async /home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:622845
    at async T (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:619351)
    at async Object.t.resolveWithLocalServer (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:622460)
    at async Object.t.resolve (/home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:644834)
    at async /home/fabrice/.vscode/extensions/ms-vscode-remote.remote-ssh-0.84.0/out/extension.js:1:727082
[10:46:47.644] ------
breuleux commented 2 years ago

Hah, I see. Shouldn't be difficult to fix if we make mila code connect to the first node in the list.