HPC documentation issue

pavlis commented 1 year ago

In the process of trying to adapt the tacc scripts to the Indiana research cluster I ran across a few obvious deficiencies in our documentation.

The user manual section titled "Deploy MsPASS on HPC" needs some work. It has a bunch of stuff specific to TACC that need to be made more generic. I have to say I also found it confusing when I was trying to use it to adapt scripts we have to the IU system. I can and will commit to fix these pages when I've finished adapting the scripts to the IU cluster but I wanted to put that out there for the record as something we need to fix.
There is an ambiguity I think may just be a documentation problem, but one I can't address because I do not know the answer. It is a fundamental one for all modern HPC systems and that is the issue of requesting nodes versus cores. The pages on slurm at IU note there is an option in the slurm --ntask_per_node. The system I'm trying to configure has dual 12-core cpus (24 total). What I have no idea is what a "task" as defined by slurm relates to dask/spark threads allowed per worker node? I would guess that bcause we use singularity there is only one real "task" per node as far as slurm is concerned but that is a pure guess. I think it would be helpful to put an answer to this question for the record here on github.
I am a bit befuddled by why the ssh tunnel setup in the tacc distributed script. I cannot make it work on our system although that section of that shell script is a perfect example of one my favorite sayings about much of the modern IT world: it is a true incantation. What I need here for the record is statement of why that is necessary, how to tell if my local cluster needs it, and what exactly is needed to set it up. Pretty much anyone but a local hpc guru would almost certainly be equally confused by our current example that has only a minimal comment explaining what it is aiming to do.

So, this issue needs a box or two of explanation. The followup is all in the documentation and probably improved comments in the example scripts.

wangyinz commented 1 year ago

Yeah, I can see it will be difficult to run on different hpc systems because the job script is highly dependent on how the hpc system is configured. It is difficult to make a universal script.

With that said, I could answer your questions 2 and 3 here, and we definitely should improve the documentation accordingly.

The word "task" in the context of Slurm really means the MPI task. Therefore, this is almost completely irrelevant to MsPASS as it is not an MPI program. However, on some HPC systems, Slurm is configured to allocate resources based on the number of tasks requested in the sbatch command, which is the case for Big Red 3 (I am not sure about Big Red 200 but would guess so). This Slurm configuration is called node sharing, and if you don't request for all 24 cores (by setting the task number to 24), Slurm will run your job on a node sharing the cores left with other jobs. Since we do want the full node for Dask/Spark to run in parallel, you probably want to set --ntask_per_node to 24 here. Note that this option is meaningless to TACC's systems because there is no node-sharing there.
The SSH tunnel is used to access the port opened by Jupyter on the compute node from the login node. This is because user accesses the Jupyter from an external IP address to the HPC cluster. Usually, the firewall policy is setup to block such external access for security reasons. That's why it is always a good idea to use the SSH tunnel from the login node as a hop to access the Jupyter Notebook on the compute node.

pavlis commented 1 year ago

Very helpful.

First on the "task" issue. That is clearly something important we need to add to our documentation. What you say there is a huge inference from any documentation I've gone through. How slurm treats dask and spark is not at all clear. Good to preserve this.

On the ssh tunnel point, given what you said I think IU’s “RED” that uses Thinlinc to run a window manager on the head note makes the ssh tunnel unnecessary. I used that functionality already to get the single mode version running. I am pretty sure the script will still need to echo something to output to tell the hostname of the node running jupyter. The confusion I had about ssh running we need to make sure we clarify. Specifically that the tunnel is needed only for remote connections to the container running the jupyter notebook. A corollary, I think, is if you are running a notebook in batch mode the tunnel is not needed either. Is that true?

When I get this working on the IU system I'll update this wrt the need for the tunnel in that context. I think, however, that the IU RED system is kind of novel and not something one is likely to find on other HPC clusters.

wangyinz commented 1 year ago

A corollary, I think, is if you are running a notebook in batch mode the tunnel is not needed either. Is that true?

Yes!

pavlis commented 1 year ago

I think this issue is solved by pull request 385 which should be merged in the next week. I'm closing this issue.

pavlis commented 1 year ago

See previous comment

mspass-team / mspass

HPC documentation issue #370