neuroinformatics-unit / HowTo

NIU website on common software problems and their troubleshooting
http://howto.neuroinformatics.dev/
Creative Commons Attribution 4.0 International
9 stars 1 forks source link

Docs currently suggest to use VSCode in the bastion node #62

Open pierreglaser opened 3 months ago

pierreglaser commented 3 months ago

Hey, and thanks for putting this documentation together! It maybe worth bookmarking the link on SWC's #computing channel to make it easily discoverable.

In the "Remote development" section of the docs, it is more or less suggested to use VSCode in the bastion node:

Then, when you click on the “Open a Remote Window” button in the bottom left corner of the VS Code window, you will see a list of the SSH hosts you have configured in your ~/.ssh/config file. You can then select the host you want to connect to - e.g. swc-gateway.

But if many users start doing this, the bastion node could run into memory errors due to many memory-hungry VSCode apps being opened at the same time. Could the docs be updated to instead recommend and explain how to set-up VSCode into a compute node? A guide to do exactly this was actually put together by Cristofer Holobetz (to find it, search "cristofer holobetz pdf" on slack). Thanks!

niksirbi commented 3 months ago

Thanks @pierreglaser, we've also noticed the issue with our remote development guide, and we're discussing changes to that here.

We are considering to update or even remove that section, because it's indeed misleading.

Regarding Cristofer's guide, I have some reservations. It's indeed possible to ssh into a compute node and do remote develpment via VSCode in this way. However, this way of running jobs is not really controlled/limited via SLURM and could lead to consuming all resources in a node. At least that's what I've understood from my discussions with @adamltyson on this topic.

It would be great to come up a remote development solution that doesn't burden the bastion node and also respects SLURM.

niksirbi commented 3 months ago

Other potential solutions for remote development:

pierreglaser commented 3 months ago

Thanks for the quick answer!

However, this way of running jobs is not really controlled/limited via SLURM and could lead to consuming all resources in a node. At least that's what I've understood from my discussions with @adamltyson on this topic.

I'm not sure about this: Cristofer's tutorial clearly states that the node in which VSCode will be started should be obtained through SLURM, using srun for instance. Did you have anything else in mind? I think that this part of the docs is very useful and I recommend keeping it, unless there are clear drawbacks (which right now I don't see once this bastion node issue is addressed).

adamltyson commented 3 months ago

Very possible that I'm wrong, but I don't understand how just by SSH-ing to a node, somehow that workload is therefore monitored by the SLURM job scheduler. It may be possible if you request the entire node, but then I'm not sure if SLURM will be able to kill the job etc.

niksirbi commented 3 months ago

I think if you start an interactive job via srun, and then ssh into it using the node name, you are indeed using the node you requested, but I don't know if the constraints on memory, cores etc are respected in that case. For example if you do srun --mem 8G and then ssh into that node, what guarantees that you won't exceed the 8G?

Anyhow, I'll message Cristofer so he can also participate in the discussion. I think he may have asked the scientific computing team about this.

lauraporta commented 3 months ago

In the case that @niksirbi is right, I've found another possible solution: start a job that runs sshd and connect vscode to it. In this way the resources used by vscode will be effectively controlled by slurm. I didn't test this solution yet. Also, open ondemand offers code server: vscode accessed via the browser running within a slurm job. I was interested in installing it some time ago.

pierreglaser commented 3 months ago

but I don't understand how just by SSH-ing to a node, somehow that workload is therefore monitored by the SLURM job scheduler

The VSCode app has to be ran in a compute node obtained through slurm, as stated in cristofer's tutorial. However, to port forward VSCode back to your local machine, you have to start an ssh process from your machine to the compute node. This ssh process won't run anything, it just allows ports to be forwarded, which requires much more complex solutions to be done via slurm.

For example if you do srun --mem 8G and then ssh into that node, what guarantees that you won't exceed the 8G?

As the slurm app on the SWC cluster is currently configured, there is a memory limit enforcment mechanism through the cgroup plugin, so yes, you are guaranteed to not exceed 8G.

niksirbi commented 2 months ago

You may very well be right @pierreglaser, but I don't sufficiently understand the internals of SLURM, cgroup, VScode's remote ssh plugin and their interactions to be confident about it. We may have to do some tests to confirm and consult Alex and John about it. Assuming we can confirm this, I'm happy to update the VSCode instructions according to @cristofer-holobetz 's guide.

pierreglaser commented 2 months ago

Mmmh. Note that regardless of whether I'm correct, Cristofer's solution is an improvement over what is currently officially suggested (use VSCode on the login node). So not sure why we should delay moving forward with this solution.

adamltyson commented 2 months ago

I think it's important we only document things we know to be correct. It's unlikely that users will regularly consult this documentation to change their workflows.

@niksirbi for now, shall we just remove this section?

niksirbi commented 2 months ago

For now I suggest the following:

I can get this done this week if you agree.

adamltyson commented 2 months ago

Sounds good to me. Thanks!

pierreglaser commented 2 months ago

I looked more into this issue: it turns out VSCode remote SSH mode does not use SLURM. Cristofer tutorial required you to ask for a compute node via slurm prior to connecting to the said node using VSCode, which led me to think otherwise. But this step is not required as VSCode just starts its own ssh connection.

As the link @lauraporta referenced link shows, this seems to be a well-documented issue on the VSCode side with only partial fixes existing. One option uses sshd within a slurm-allocated compute node (the one Laura mentioned), but the SLURM environment variables are not inherited by the new connections, and require additional hacks to be fully functional, so not ideal.

Another option is to use code-server (a program which serves VSCode in a webapp) in compute nodes, and use VSCode in your local machine's browser. Unlike other alternatives, the steps to get setup are very simple:

  1. install code-server by downloading the binaries (alternatively we could ask IT to set it up globally on all nodes)
export VERSION=4.91.0
mkdir -p ~/.local/lib ~/.local/bin
curl -fL https://github.com/coder/code-server/releases/download/v$VERSION/code-server-$VERSION-linux-amd64.tar.gz \
  | tar -C ~/.local/lib -xz
mv ~/.local/lib/code-server-$VERSION-linux-amd64 ~/.local/lib/code-server-$VERSION
ln -s ~/.local/lib/code-server-$VERSION/bin/code-server ~/.local/bin/code-server
PATH="~/.local/bin:$PATH"

The in-browser experience is pretty-much feature-complete since VSCode is ran under the hood (You can install extensions, start a terminal etc).

code-server seems widespread, and the solution is both robust and respects SLURM. WDYT?

niksirbi commented 2 months ago

Thanks a lot for investigating this @pierreglaser!

I gave it a shot and it indeed seems to work just fine (incl. from Firefox). This is definitely an improvement on the previous guide, so I'll write up something and have it tested by a few more people.

If all seems well we can ask IT to centrally install code-server, which will make the instructions even simpler.

adamltyson commented 2 months ago

If we're asking IT to install stuff centrally, is it worth just going straight for VSCode via OOD?

niksirbi commented 2 months ago

Well the two things are complementary. If people would like to use a VSCode app via OOD, the IT would have to anyway centrally install code-server and then link it to OOD. Pierre's instructions provide a way to use code-server directly, without making it an OOD app. In a way OOD is an abstraction layer that will make this procedure more user-friendly, and can additionally serve as an entry-point to other apps like Jupyter Lab.

So installing code-server (+ the how to guide that comes with it) is a stepping stone towards full OOD functionality, not opposed to it.

adamltyson commented 2 months ago

Cool, I assumed that the existing VSCode OOD app worked some other way, so it would be duplicating effort for IT.

niksirbi commented 2 months ago

From what I can find browing online, using code-server seems to be the most popular choice for creating a VSCode app for OOD, see https://discourse.openondemand.org/t/vscode-showcase/2256