scripts: move to FS-global / rank-local caches

matthiasdiener commented 4 months ago

Questions for the review:

[x] Is the scope and purpose of the PR clear?
- [x] The PR should have a description.
- [x] The PR should have a guide if needed (e.g., an ordering).
[x] Is every top-level method and class documented? Are things that should be documented actually so?
[x] Is the interface understandable? (I.e. can someone figure out what stuff does?) Is it well-defined?
[x] Does the implementation do what the docstring claims?
[x] Is everything that is implemented covered by tests?
[ ] Do you see any immediate risks or performance disadvantages with the design? Example: what do interface normals attach to?

tulioricci commented 4 months ago

Could we easily check the cache path within the driver?

MTCam commented 4 months ago

Could we easily check the cache path within the driver?

Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.

Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?

matthiasdiener commented 4 months ago

Could we easily check the cache path within the driver?

Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.

What do you think of cfc1af1?

The output looks like:

LOOPY_NO_CACHE= POCL_CACHE_DIR=foo python -m mpi4py examples/wave.py --numpy
[...]
Rank 0 disk cache config: loopy: False ; pyopencl: True (default dir); pocl: True (foo);

Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?

Yes, that can happen, in particular if you are running exactly the same case with the same runtime config at the same time.

matthiasdiener commented 4 months ago

From an email by John Gyllenhaal on Mar 12, 2020:

You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!

Looking at https://hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably /p/gpfs1/$USER may be a better location for the default disk cache storage?

MTCam commented 4 months ago

From an email by John Gyllenhaal on Mar 12, 2020:

You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!

Looking at hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably /p/gpfs1/$USER may be a better location for the default disk cache storage?

It is a little odd to default it opaquely to a platform-specific filesystem path. The previous setting was /tmp and that's reasonable because it is platform-independent, node-local, and job-local. When we were given space in the LC workspace, they also asked that we not install packages there that use conda because of the typical sizes of the base installs. That includes MIRGE-Com, so we should also not be installing MIRGE-Com in those NFS places indicated above.

Where has CUDA_CACHE_PATH been defaulting to all this time?

For most of us, and following system guidance, $(pwd) from a job script will be somewhere down in /p/gpfs1/$USER on lassen already, or on Lustre for quartz. Nobody should be running parallel jobs from NFS mounted file spaces. So I think defaulting to $(pwd) is also reasonable

illinois-ceesd / mirgecom

scripts: move to FS-global / rank-local caches #1017