Cluster workflow feature to allow shell commands or script to run before remote server setup (e.g. slurm) (wrap install script)

wwarriner commented 5 years ago

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

wwarriner commented 5 years ago

I managed to modify the extension.js file in the following way:

CTRL+F -> "bash Change the string literal "bash" to "bash -c \"MY_COMMAND bash\""

I've confirmed that this correctly starts the VS Code Remote SSH server on a compute node. Now I am running into a port-forwarding issue, possibly related to issue #92. Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Thanks for the hard work on this so far! This extension has extraordinary potential. Being able to run and modify a Jupyter notebook remotely on our cluster, while using intellisense and gitlens, AND conda environment detection and dynamic swapping, all in a single application for FREE is incredible.

roblourens commented 5 years ago

Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Do you mean that port forwarding for ssh is disabled on that server? Or are you able to forward some other port over an ssh connection to that server?

wwarriner commented 5 years ago

Port forwarding for SSH is not disabled on any part of our cluster. I am not intentionally attempting to forward any other ports to the server. I was using remote.SSH.enableDynamicForwarding and remote.SSH.useLocalServer. Your questions gave me the idea to disable those options. I can't determine if that has helped because my earlier assertion was incorrect. I can't get the server to run on a compute node.

To address that issue, and to clarify our workflow some, we are using Slurm. It is highly preferred to have tasks running within a job context so that login node resources aren't being consumed. To do that, we create a job using srun (or one of its siblings) with appropriate resource request parameters. Any commands we want to run are provided as the final argument to srun. All calls to srun must have a command, because it uses execve() to invoke the commands apparently. If no command is passed, srun fails with an error message. With that in mind, setting up the VS Code server on the remote would have to be funneled through a call to srun. Any other method of invocation (such as bash -c) will result in commands being run out of the job context, and thus on the login node. Naively modifying the bash invocation does not work, apparently because srun never receives any arguments. It isn't clear to me how the server installer gets invoked and set up, so I can't offer any suggestions.

As a side note, it is also possible to provide the argument --pty bash to srun to get a terminal within the job context on a node allocated for that job. Looking at #1671, specifically here. It seems like it should be possible to adjust the invocation of bash -ilc to do additional things (found by ctrl+f). I've tried testing this but it doesn't look like that code is called at any point that I can tell, using echo for debugging.

roblourens commented 5 years ago

What code do you mean by "that code"? I don't think the issue you point to is related.

We run the installer script essentially like echo <installer script here> | ssh hostname bash. There is an old feature request to be able to run a custom script before running the installer. I am not sure whether that would help you here, is there a way with Slurm to run a command, then have the rest of the same script run in a job context?

It sounds more like you need a way to wrap the full installer script in a custom command, like srun "<installer script here>" is that right?

wwarriner commented 5 years ago

Yes to your last question, ideally with the ability to customize the wrapping command.

nicocarbone commented 5 years ago

This would be a important feature for vscode-remote. I am currently trying to use vscode to run some interactive python code in a shared cluster and the only way of doing it is by using the srun command of slurm. I'll try to find a workaround, but I think there really is a user case for this feature request.

daferna commented 5 years ago

I've got the same issue, but with using LSF instead of SLURM. As @roblourens points out here: https://github.com/microsoft/vscode-remote-release/issues/1829#issuecomment-553525298 just running the install script and starting the server only solves half the problem. Once the server is started, I surmise that VS code will still try sshing directly into the desired (login-restricted) machine to discover what port the VS-remote server picked, as well as starting new terminals that show up in the GUI.

Basically, the only way this can work is if all subprocesses for servers and user terminals are strictly forked children from the original seed shell acquired from LSF/SLURM/whatever job manager you are using. A hacky workaround may be to use something like Paramiko to start a mini-SSH server from the seed shell and then login to this mini server directly from VS Code (assuming there isn't a firewall blocking you, but obviously reverse SSH tunnels can be used to get around that).

benfei commented 4 years ago

Another possible resolution to this issue is by enabling a direct connection to the remote server. That is, the user would:

Launch vscode-server on a remote (possible login-restricted) host.
Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

ihnorton commented 4 years ago

A slight variant on this: I would like to be able to get the target address for SSH from a script (think cat'ing a file that is semi-frequently updated with the address of a dynamic resource). Currently I am using a ProxyCommand configured in sshconfig, but that has the disadvantage of requiring a second process.

brando90 commented 4 years ago

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

@wwarriner Is the issue you are referring to the same one as the one on this stack overflow SO question?

It sounds like we are having a similar problem, when I spin an interactive job and try to run my debugger, I can't do it because it goes back to the head node and tries to run things there.

https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac

brando90 commented 4 years ago

The problem is more serious than I thought. I can't run the debugger in the interactive session but I can't even "Run Without Debugging" without it switching to the Python Debug Console on it's own. So that means I have to run things manually with python main.py but that won't allow me to use the variable pane...which is a big loss! (I was already willing to lose the breakpoint privilege by using pdb, which I wasn't a super big fan but ok fine while things get fixed...)

What I am doing is switching my terminal to the conoder_ssh_to_job and then clicking the button Run Without Debugging (or ^F5 or Control + fn + f5) and although I made sure to be on the interactive session at the bottom in my integrated window it goes by itself to the Python Debugger window/pane which is not connected to the interactive session I requested from my cluster...

daeh commented 4 years ago

Am I reading this right that currently the only way to have the language server run a compute node rather than the head/login node is to modify extension.js? Or is there a different preferred solution? I'm also getting weird port conflicts when I modify extension.js.

(I'm also using slurm and the python language server eating up 300GB on the head node disrupts the whole department).

daeh commented 4 years ago

I'm curious if this is on the roadmap for the near future. With my university going entirely remote for the foreseeable future, being able to use this extension to work on the cluster would be absolutely amazing.

brando90 commented 4 years ago

Yes, I also want this feature a lot with universities going remote due to COVID-19

brando90 commented 4 years ago

Another possible resolution to this issue is by enabling a direct connection to the remote server. That is, the user would:

Launch vscode-server on a remote (possible login-restricted) host.

Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

how do you do that? Have you tried it?

roblourens commented 4 years ago

No capacity to address this in the near future but I am interested to hear how the cluster setup works for other users - if anyone is not using slurm/srun as described above please let me know what it would take to make this work for you.

alfonzso commented 4 years ago

I put this to settings.json:

"terminal.integrated.shellArgs.linux": [
    "-c",
    "export FAF=FEF ; exec $SHELL -l",
  ]

After that every linux shell will has "FAF" env variable ( what I wanted ), furthermore with "exec" command , no new process created !

I hope this will be useful for someone :D !

Nosferican commented 4 years ago

I guess this is related. I would like VS code clients (e.g., julia client) to have an option to start in the Slurm job I am currently in and not in the login node.

Nosferican commented 4 years ago

I am able to get the Julia language server by having added

ml >/dev/null 2>&1 && ml julia

to my ~/.bashrc.

For Slurm jobs, I have to

Start the Julia client.
From the Julia client run the ijob command line
Start Julia again from that shell.

Would be great to at least start the Julia client from the job shell as a Julia client session.

One issue with that approach is that it starts Julia from the shell and not the client so it misses out on a few features such as vscodeddisplay for being able to display tabular data.

srgk26 commented 4 years ago

I tried to work on this for over a day, I may have got a somewhat working solution, inspired by @Nosferican's idea, to run the command line job from within the julia client. But I didn't have to add anything to my ~/.bashrc for it to work.

One caveat though, is like he said, I couldn't view dataframes using vscodedisplay function, nor am I able to view plots. But I suppose one hacky workaround for plots, is to save them and open them up alongside in vscode itself. Screenshow below shows how it worked:

This was using julia, but I'm sure similar setup could be followed through python/R, i.e. by invoking shell command features and running srun from within julia/python/r, like this:

srun -c --pty julia

As @Nosferican said though, and as shown in my screenshot, images couldn't be displayed. Any ideas?

P.S. BTW before trying out this, I've tried all sorts of ways to get around this today, for e.g. by adding this to my settings.json:

"terminal.integrated.shellArgs.linux": [
    "-c",
    "srun -c 6 --pty bash",
]

Also tried to work around by using a tmux setup that would be running on a compute node, hoping any new julia/python/r instance would also be using the same instance. The tmux setup would be something like his: https://github.com/julia-vscode/julia-vscode/issues/426

But using that method, I could only get python to execute code on the terminal, doesn't even work for its interactive jupyter view, and not for r and julia.

Don't know enough about vscode's integrated terminal setup to manipulate the ports either.

ahmednrana commented 4 years ago

Another use case would be to transfer code to server while ssh in to it. run the command rsync (automatically) ... on local server before opening the connection

daeh commented 4 years ago

Any chance of getting this out of backlog and into a milestone, @roblourens? It would be amazing to be able to use this extension.

ctr26 commented 4 years ago

Similar problem, I tried to request an interactive shell on my cluster via my login node. Unfortunately this causes the vs code extension to timeout.

I'm using LSF and I run: bsub -Is "zsh" in my .bashrc.

I'm guessing it's a port forwarding problem between the client server-hosted extension files?

brando90 commented 4 years ago

Has there been any progress on this? Can we now ssh directly to an interactive session and have it work? (https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac)

Nosferican commented 4 years ago

@roblourens There seems to be about 37 non-bugs in the backlog milestone. Could you give a rough estimate on how high this issue ranks in terms of priorities? For example, next release not the one after of end of the year?

Lucecpkn commented 4 years ago

Has there been any progress on this? Can we now ssh directly to an interactive session and have it work? (https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac)

Update: There's a huge problem with this approach. Please see the discussion below by @Nosferican.

I confirm the answer in the StackOverflow works for me! Thank you and the author of the answer!

Although I found there might need to be a bit modification from the original answer, mainly I think we need to add username@ before login server name (sorry I'm not able to comment there since my StackOverflow account is new).

A recap of the procedure:

Submit an interactive job (e.g. salloc for slurm), get the computing node assigned.
On VSCode, add remote SSH using ssh -J username@your.login.server username@nodeXXX. The -J option will set the "ProxyJump" in the ~/.ssh/config file, and it will look like:
```
Host MyCluster
HostName nodeXXX
ProxyJump username@your.login.server
User username
```
The setup is ready, you can open this SSH target in VSCode. You might need to enter your password twice for first logging into the login node, and then the computing node. Now you should be able to work on the remote computing node!!

A reminder: the key is to set the ~/.ssh/config correctly, be aware of the jump node's name. And remember to change the nodeXXX name every time to the computing node assigned.

p.s. It originally didn't work on one of my clusters somehow. But after I use the SSH key file, and specify the IdentityFile in the ~/.ssh/config, the problem was solved. So, I suggest using the SSH key and set the ~/.ssh/config as:

Host MyCluster
    HostName nodeXXX
    ProxyJump username@your.login.server
    User username
    IdentityFile ~\.ssh\my_key

This saves you from entering your password twice every time anyway.

Nosferican commented 4 years ago

I tried the solution. I am able to start VS code on the computing node but it returns a shell in the computing node on in the Slurm job. Is there a way to have VS Code shell / language servers step into the job?

Lucecpkn commented 4 years ago

I tried the solution. I am able to start VS code on the computing node but it returns a shell in the computing node on in the Slurm job. Is there a way to have VS Code shell / language servers step into the job?

Sorry, I'm not sure what you mean. Did you try to open the Explorer in VSCode and work on some code scripts? I think the language server will step in automatically when you work on a certain code script.

Nosferican commented 4 years ago

Aye. The solution works in the sense I can connect to the compute nodes but I am not inside the Slurm job so I don't have access to the resources it allocated for it. I can start coding and the language server steps in but I am now consuming resources on that node that would not be tracked by the cluster job manager through Slurm.

ctr26 commented 4 years ago

The solution isn't super practical for me as the nodes get allocated with arbitrary names

Ideally this proxy jumping would be automated

Lucecpkn commented 4 years ago

Aye. The solution works in the sense I can connect to the compute nodes but I am not inside the Slurm job so I don't have access to the resources it allocated for it. I can start coding and the language server steps in but I am now consuming resources on that node that would not be tracked by the cluster job manager through Slurm.

OMG you are right! Forgive me, I didn't even know if the job is being done inside the Slurm allocation! (Though I notice the abnormal low usage when I use seff to check the job. Now I feel guilty of my illegitimate use.)
Could you please tell me how to check if the job is being inside the Slurm allocation?

Nosferican commented 4 years ago

The easiest way to check would be to look for the Slurm environmental variables that get set up by default. https://slurm.schedmd.com/srun.html#lbAI for example, if inside the Slurm job it should have an environmental variable SLURM_JOB_ID (it will have a bunch others as well).

Lucecpkn commented 4 years ago

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Nosferican commented 4 years ago

I am not sure how it would work based on the pipeline. I can potentially have different Slurm jobs in the same compute node. If I SSH to that node, how would it know which job to pick from for executing my code?

benfei commented 4 years ago

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt is configured in your cluster?

Nosferican commented 4 years ago

Currently an alternative is to run the VS Code OSS version for the cluster (only have access to open source extensions and not the marketplace due to not being Microsoft)

Should be able to do the same with self-hosted code spaces.

Lucecpkn commented 4 years ago

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt is configured in your cluster?

Sorry, I don't know. I only know I'm not able to ssh into nodes that I do not have a running job.

benfei commented 4 years ago

So it's probably configured in your cluster. In this case, there is no problem to ssh into a slurm job (via vscode or otherwise). The problem begins when such a configuration doesn't exist (for example in LSF or other scheduler).

On Thu, Aug 6, 2020 at 12:03 AM Pan notifications@github.com wrote:

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html is configured in your cluster?

Sorry, I don't know. I only know I'm not able to ssh into nodes that I do not have a running job.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-remote-release/issues/1722#issuecomment-669509367, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACVZMBQT5QP5BSXMXILGLH3R7HCKTANCNFSM4JEYF4NQ .

wwarriner commented 4 years ago

@Nosferican We've started working with VS Code OSS. It's a shame that the extension list is smaller due to the Marketplace limitation. I don't see Marketplace access changing, perhaps MS will make their remote dev server FOSS at some point.

I'd like to build on the recent discussions about SSHing into nodes. We have pam_slurm_adopt configured in our cluster. The challenge for us is that each node has a name based on its numerical acquisition order. Which node to SSH into isn't known before the job is created and can't be selected by the user. Currently the workflow would have to be: (1) create a job; (2) get the node number; (3) send that information back to local VSCode; (4) use the node as the SSH target for remote dev; (5) start remote dev. Certainly this could be managed manually each time the user wishes to connect VSCode to the cluster, but that is clunky, inflexible and error prone. I haven't been able to figure out how to automate the process in my free time, perhaps someone with more skill could figure this out?

asalimih commented 4 years ago

I wanted to run the vscode python debugger on an interactive Slurm job in a remote server. I tried to edit the extension.js file but it didn't work for me. Here is my workaround:

on the remote server create a file named bash somewhere for example /home/myuser/pathto/bash
make it executable using chmod +x bash
write salloc [your desired options for the interactive job] in the bash file
In vscode Settings search for Automation Shell: Linux and click on the "Edit in settings.js"
change the line to "terminal.integrated.automationShell.linux": "/home/myuser/pathto/bash" and save it (use the absolute path. for example ~/pathto/bash didn't work for me)
Done :)

now every time you run the debugger it will first ask for the interactive job and the debugger will run on it. but take in to consider that this is also applied to tasks you run in tasks.json.
also you can use srun instead of salloc. for example srun --pty -t 2:00:00 --mem=8G -p interactive bash

lkugler commented 4 years ago

@asalimih works, but is there a way to automatically scancel the job when finished or reclaim an existing (probably more complicated)?

asalimih commented 4 years ago

@asalimih works, but is there a way to automatically scancel the job when finished or reclaim an existing (probably more complicated)?

this was for interactive debugging. when you run the debugger it will open a terminal so whenever you finished debugging, you can close the terminal bash and the job will be automatically canceled. In order to connect and disconnect from an interactive job without it being canceled, I guess you can use tmux like here. Here is my suggestion but I haven't tried it.

first manually run a tmux session like tmux new -s debug_session
create an interactive job inside it: salloc [your desired options for the interactive job]
in the bash file (that I explained in previous post ) you just attach to the tmux session: tmux attach -t debug_session

now if you start the debugger it will run inside that interactive job inside the tmux session (I guess :) ) and closing the terminal won't stop the job.

lkugler commented 3 years ago

Since the today, the debugger is unusable. :-( This is the case, since calling the debugger does not wait for a finished source activate; conda activate xyz, so the script runs with the (wrong) "base" environment. What could have changed that? Any ideas how to solve that?

salloc: Granted job allocation 1953
(base) bash-4.4$  /usr/bin/env /.../user/miniconda3/envs/plot/bin/python /...user/.vscode-server/extensions/ms-python.python-2020.12.422005962/pythonFiles/lib/python/debugpy/launcher 43457 -- /...user/osse_analysis/main.py 
source /...user/miniconda3/bin/activate
conda activate plot
-> fails

I also noticed that vscode now reports Your service running on port 12345 is available.. I am using @asalimih 's strategy. Edit: Do I not have to call srun --jobid=1234 /bin/bash after salloc to grab the allocation and run it on the compute node?

alexmondaini commented 3 years ago

where extension.js located . I can't find this file

aparente-nurix commented 3 years ago

Has this ever been resolved? It would be really nice to use jupyter notebooks in an interactive node as opposed to having to use sagemaker or creating another ec2 instance specifically for notebooks.

samfux84 commented 3 years ago

Hi,

It would be really nice to have a feature to run the remote ssh plugin in a batch job. On our cluster with more than 3000 active users, we regularly have to temporarily ban users, because they use the remote ssh plugin of vscode that then starts a large number of threads on the login nodes. Login nodes have only few resources and are shared among many cluster users. Therefore starting that many threads is not acceptable. I could not find any option in VSCode to restrict the number of threads that VSCode starts to one or two.

The HPC cluster is located more than 200 km away from the university. I tried to make an installation of VSCode directly on the cluster such that users can run it through X11, but because of the large distance and VSCode being very resource hungry, this is not an option (the GUI is more or less unresponsive). We tried modifying the extension.js file as described in one of the first comments here, but this results in the same issues as described above (port-forwarding).

Using an SSH tunnel between a local computer, the login node and a compute node is not really an option due to the lack of security on the VSCode side. Every other user on the same host that is listening to the same port can potentially access the information exchanged between the remote ssh server and the VSCode client.

Jupyter developers have solved this issue by having a security token that needs to be entered for connecting to a remote Jupyter notebook session and this can easily be scripted. We provide a script to our users that they can run on their local computer that connects to the cluster, starts jupyter in a batch job and finally starts the browser on the local computer of the user and connects it through an SSH tunnel with the jupyter notebook running in a batch job (https://gitlab.ethz.ch/sfux/Jupyter-on-Euler-or-Leonhard-Open).

I would like to setup something similar for VSCode, but so far we could not get beyond starting the remote ssh server in a batch job. This is a pity, because there seems to be a demand for using VSCode on HPC clusters. If we cannot find a way to integrate this smoothly with the batch system, then the only solution for us would be to recommend to the users to use a different IDE to develop and debug their code, which can be integrated with the batch system, or which can run through X11 without immediately becoming unresponsive.

Best regards

matteosecli commented 3 years ago

Hi @samfux84, as a cluster user I agree it would be a great thing to have! 🙂

In the meantime, I've been experimenting with a couple of options.

Each user could configure their own ssh config file as (each entry has the options User, IdentityFile, ServerAliveInterval and ServerAliveCountMax as well, omitted here for brevity)
```
Host entrypoint
  HostName <entrypoint.whatever.com>

Host loginhost
  HostName <login.cluster.com>
  ProxyCommand ssh -q -W %h:%p entrypoint

Host node-????
  HostName %h
  ProxyCommand ssh -q -W %h:%p loginhost
```
Then one should submit a job, take note of the assigned node-wxyz and connect to it in the host list proposed by VS Code. But maybe this raises security concerns? Not really looked into that. As a downside note, this option would then require additional trickery and tampering in order to properly load cluster modules before VS Code itself starts on the compute node (e.g. we use lmod).
I've also been looking into code-server, a fork of VS Code that is meant to run in a browser. I've put together a small ugly script that starts code-server on a remote compute node and then asks the user to enter a given command in order to setup a port forward between the compute node and the user's computer; if that's of any interest, I can share it here. Note that code-server can be configured to accept a password or a OTP and can also be configured to communicate over HTTPS, so I guess it's a safer option. The fact that it runs in the browser avoids all the X11 forwarding stuff, which is great; the only real downside is that it doesn't have the official VS Code Extension store, but a sort of clone with some missing extensions (most notably the remote extensions, but I guess one can live without that since it's already running on a remote host). One can enforce the official store via two environment variables, although that's in a grey licensing area, so maybe it's best to leave it to the user. Currently, the workflow with code-server looks like that:
- Submit a job, load the lmod modules that you want to use in code-server (in this respect, this solution is better than 1.), and start the wrapper script.
- The wrapper script extract the user password/OTP from code-server log file and shows it to the user; it also asks the user to run a ssh command on the local computer in order to setup a port forward. If one has the ssh config file setup like in 1., the command looks like ssh -fNL 8444:node-wxyz:8000 node-wxyz if e.g. code-server is listening on port 8000 of the compute node.
- Open localhost:8444 and you have a full VS Code environment.

roblourens commented 3 years ago

I would love to be able to support this but am honestly still not sure what is required to make it work. It seems that getting the name of the node to connect to, connecting to it, and loading modules into it, etc is a manual multi-step process. Is it even possible for vscode to run this whole flow in a single automated step?

Can you share the script you wrote for code-server? That might help me understand, and I don't follow why auth is the hangup for vscode.

matteosecli commented 3 years ago

@roblourens here you go: https://github.com/matteosecli/code-server-wrapper 🙂

It really isn't anything fancy or game-changing though, just a helper.

As for automation, it's definitely possible to run it in an automated way via a job submission script; I've just tested that. The software modules can be loaded either in the helper itself, by just adding a line to it, or in the job submission script.

It's true that setting up the ssh forwarding is a bit of a pain in a HPC setting; however, if one sets up a ssh config file like the one I've sketched above, it becomes much much easier and a simple command like ssh -fNL 8444:node-wxyz:8000 node-wxyz is enough.

As for Microsoft's VS Code itself, instead, once one sets up a config file like the one above, connecting to a compute node becomes basically trivial once one has been allocated resources on a compute node. This needs to be done manually, but I think it's not difficult to set up a script that asks for a compute note, obtains the name of the assigned compute node, and then connects VS Code to it; maybe an extension could be made for that. The real downside in this setup, for me, is that it's not possible to easily modify the environment VS Code runs into by loading additional software modules. I've tested this scenario as well, and currently this is what I do as a workaround:

[just once] Connect to the login node for the first time via VS Code; once the remote VS Code opens and completes the setup, I just close it. This is just to allow VS Code to set up the remote components in ~/.vscode-server; since the home folder is shared among all the nodes, once this is setup it also becomes visible to the compute nodes.
[just once] Then, again from the login node, I edit the newly setup file ~/.vscode-server/bin/<youridentifier>/server.sh by loading additional modules (compilers, software libs, etc), changing env variables, etc. right before the line that runs node. I also change the interpreter to bash because I prefer to work with that, but that's really just a personal taste.
I ask for a compute node; once resources have been allocated, I connect to the compute node directly from VS Code's interface and I finally get the environment that I want (i.e. VS Code can now see the additional compilers, libs, etc).

This works, but it's not an ideal workflow. Additionally, in order to change the environment (to test maybe a new compiler or add another module), one has to go back to edit VS Code's server script. I also guess (not sure though) that, when the remote components of VS Code are updated, ~/.vscode-server/bin/<youridentifier>/server.sh gets overwritten and one has to repeat step 2.

roblourens commented 3 years ago

So once the user does the manual steps of getting a node and setting up their SSH config in the way you suggest, the only missing piece is

The real downside in this setup, for me, is that it's not possible to easily modify the environment VS Code runs into by loading additional software modules

Is this something that the user can do before vscode connects, or does it have to be done in the same script that the vscode server starts in? Would it work if you had the ability to pass a script for Remote-SSH to run before it runs server.sh? I am not sure about the security issues with that option however.

microsoft / vscode-remote-release

Cluster workflow feature to allow shell commands or script to run before remote server setup (e.g. slurm) (wrap install script) #1722