illinois-ceesd / mirgecom

MIRGE-Com is the workhorse simulation application for the Center for Exascale-Enabled Scramjet Design at the University of Illinois.
Other
11 stars 19 forks source link

scripts: move to FS-global / rank-local caches #1017

Closed matthiasdiener closed 2 months ago

matthiasdiener commented 4 months ago

Questions for the review:

tulioricci commented 4 months ago

Could we easily check the cache path within the driver?

MTCam commented 4 months ago

Could we easily check the cache path within the driver?

Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.

Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?

matthiasdiener commented 4 months ago

Could we easily check the cache path within the driver?

Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.

What do you think of cfc1af1?

The output looks like:

LOOPY_NO_CACHE= POCL_CACHE_DIR=foo python -m mpi4py examples/wave.py --numpy
[...]
Rank 0 disk cache config: loopy: False ; pyopencl: True (default dir); pocl: True (foo);

Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?

Yes, that can happen, in particular if you are running exactly the same case with the same runtime config at the same time.

matthiasdiener commented 4 months ago

From an email by John Gyllenhaal on Mar 12, 2020:

You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!

Looking at https://hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably /p/gpfs1/$USER may be a better location for the default disk cache storage?

MTCam commented 4 months ago

From an email by John Gyllenhaal on Mar 12, 2020:

You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!

Looking at hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably /p/gpfs1/$USER may be a better location for the default disk cache storage?

It is a little odd to default it opaquely to a platform-specific filesystem path. The previous setting was /tmp and that's reasonable because it is platform-independent, node-local, and job-local. When we were given space in the LC workspace, they also asked that we not install packages there that use conda because of the typical sizes of the base installs. That includes MIRGE-Com, so we should also not be installing MIRGE-Com in those NFS places indicated above.

Where has CUDA_CACHE_PATH been defaulting to all this time?

For most of us, and following system guidance, $(pwd) from a job script will be somewhere down in /p/gpfs1/$USER on lassen already, or on Lustre for quartz. Nobody should be running parallel jobs from NFS mounted file spaces. So I think defaulting to $(pwd) is also reasonable