Closed wtbarnes closed 1 year ago
So there's also this guide which I'm sure I've read, but completely forgot about. It covers mostly everything we are aiming to include in this repo.
Ok I think I've got at least a minimal set of instructions for the Pleiades setup. @dshean can you give this a once over?
The dask-jobqueue config stuff will still need some fiddling (e.g. selecting different configs for different models), but that can probably go in a separate PR.
I am working on Pleiades again so I thought I'd check in on this. How's it going?
I also wrote some similar stuff here: https://github.com/rabernat/pleiades_llc_recipes
I've neglected this a bit since the meeting, but was waiting on some comments from @dshean.
A few things that could probably be addressed:
install-conda.sh
These don't necessarily need to be addressed in this PR.
Another larger issue is how best to make this information visible? The Pangeo webpage already has a similar version of this info that I linked to above as well as the more specific guide that @rabernat linked.
Should we remove the stuff currently on the pangeo page and consolidate it all here? We could publish to GitHub pages from this repo and then link to it from the main webpage. I'm not sure what the preferred way forward is, but repeated and possibly conflicting information is never ideal.
This really deserves its own issue(s).
Hi all. I'm sorry @wtbarnes - I dropped the ball on this.
I had not seen the "Getting Started with Pangeo on HPC" guide or @rabernat's existing Pleiades guide. I agree, those are excellent resources and contain most of the important information. There are some small tweaks that can be made for Pleiades, but many come down to personal preferences (e.g., using pbs_rfe
to reserve a dedicated node for jupyter lab vs. submitting a script like @rabernat's launch-notebook.sh
).
@wtbarnes raises an important question about how to organize and package this material moving forward. I like the idea of a centralized, general recipe, with separate pages documenting relevant details for the different HPC options. But as @rabernat mentioned, things are evolving so quickly that maintenance becomes an issue, and I don't have a sense of the actual demand. How many people out there need this information? Maybe stats on http://pangeo.io/setup_guides/hpc.html?
I think the main issue still remains for Pleiades - the primary queues are slammed (esp with end of fiscal year allocations), so spinning up a dask-jobqueue PBSCluster for interactive computing can take hours-days. That's a showstopper. We don't have a "premium" queue on Pleiades that allows for rapid scheduling of many jobs with short walltimes, as required to effectively use the dask-jobqueue model. @wtbarnes and I were going back and forth with support/management on this, so hopefully that will pan out.
I have not tested the dask-mpi option with a single PBS job that reserves many nodes. This could work with the rapid scheduling for 1-job/2-hour limit on the devel queue.
Testing this guide out with some work I'm doing and making a few notes here to remind myself to fix a few things:
ssh -L 127.0.0.1:8888:${RFE_IP}:8888
~/.config/dask/jobqueue.yml
I suspect this effort has gone very stale and I do not really have the time or interest to finish this unfortunately. If someone else wants to pick this up, please feel free to reopen!
This is a work in progress for some install instructions and setup scripts for getting a Jupyter Lab and Dask cluster up and running on NASA Pleiades