WIP: Setup instructions for NASA Pleiades

wtbarnes commented 5 years ago

This is a work in progress for some install instructions and setup scripts for getting a Jupyter Lab and Dask cluster up and running on NASA Pleiades

wtbarnes commented 5 years ago

So there's also this guide which I'm sure I've read, but completely forgot about. It covers mostly everything we are aiming to include in this repo.

wtbarnes commented 5 years ago

Ok I think I've got at least a minimal set of instructions for the Pleiades setup. @dshean can you give this a once over?

The dask-jobqueue config stuff will still need some fiddling (e.g. selecting different configs for different models), but that can probably go in a separate PR.

rabernat commented 5 years ago

I am working on Pleiades again so I thought I'd check in on this. How's it going?

I also wrote some similar stuff here: https://github.com/rabernat/pleiades_llc_recipes

wtbarnes commented 5 years ago

I've neglected this a bit since the meeting, but was waiting on some comments from @dshean.

A few things that could probably be addressed:

Adding the Dask lab extension config step to install-conda.sh
Add multiple jobqueue configs for different systems (e.g. Merope, Pleiades, Endeavour). The current config is for Merope.
Add an example notebook for testing the configuration

These don't necessarily need to be addressed in this PR.

wtbarnes commented 5 years ago

Another larger issue is how best to make this information visible? The Pangeo webpage already has a similar version of this info that I linked to above as well as the more specific guide that @rabernat linked.

Should we remove the stuff currently on the pangeo page and consolidate it all here? We could publish to GitHub pages from this repo and then link to it from the main webpage. I'm not sure what the preferred way forward is, but repeated and possibly conflicting information is never ideal.

This really deserves its own issue(s).

dshean commented 5 years ago

Hi all. I'm sorry @wtbarnes - I dropped the ball on this.

I had not seen the "Getting Started with Pangeo on HPC" guide or @rabernat's existing Pleiades guide. I agree, those are excellent resources and contain most of the important information. There are some small tweaks that can be made for Pleiades, but many come down to personal preferences (e.g., using pbs_rfe to reserve a dedicated node for jupyter lab vs. submitting a script like @rabernat's launch-notebook.sh).

@wtbarnes raises an important question about how to organize and package this material moving forward. I like the idea of a centralized, general recipe, with separate pages documenting relevant details for the different HPC options. But as @rabernat mentioned, things are evolving so quickly that maintenance becomes an issue, and I don't have a sense of the actual demand. How many people out there need this information? Maybe stats on http://pangeo.io/setup_guides/hpc.html?

I think the main issue still remains for Pleiades - the primary queues are slammed (esp with end of fiscal year allocations), so spinning up a dask-jobqueue PBSCluster for interactive computing can take hours-days. That's a showstopper. We don't have a "premium" queue on Pleiades that allows for rapid scheduling of many jobs with short walltimes, as required to effectively use the dask-jobqueue model. @wtbarnes and I were going back and forth with support/management on this, so hopefully that will pan out.

I have not tested the dask-mpi option with a single PBS job that reserves many nodes. This could work with the rapid scheduling for 1-job/2-hour limit on the devel queue.

wtbarnes commented 5 years ago

Testing this guide out with some work I'm doing and making a few notes here to remind myself to fix a few things:

Add note on how to get IP of RFE
Port forwarding instructions for RFE are wrong. Need to specify both IP addresses so that line should actually read, ssh -L 127.0.0.1:8888:${RFE_IP}:8888
I'm not sure setting the env var for the Dask config works. We should probably recommend a different solution, e.g. just copying it to ~/.config/dask/jobqueue.yml

wtbarnes commented 1 year ago

I suspect this effort has gone very stale and I do not really have the time or interest to finish this unfortunately. If someone else wants to pick this up, please feel free to reopen!

pangeo-data / pangeo-for-hpc

WIP: Setup instructions for NASA Pleiades #1