Closed matthiasdiener closed 2 months ago
Could we easily check the cache path within the driver?
Could we easily check the cache path within the driver?
Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.
Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?
Could we easily check the cache path within the driver?
Should a snippet be added to the mpi entry point to indicate the cache directories via stdout? Since the cache directories potentially have an impact on performance, maybe it wouldn't be such a bad idea.
What do you think of cfc1af1?
The output looks like:
LOOPY_NO_CACHE= POCL_CACHE_DIR=foo python -m mpi4py examples/wave.py --numpy
[...]
Rank 0 disk cache config: loopy: False ; pyopencl: True (default dir); pocl: True (foo);
Will multiple jobs running at the same time potentially collide if they all indicate the same cache locations?
Yes, that can happen, in particular if you are running exactly the same case with the same runtime config at the same time.
From an email by John Gyllenhaal on Mar 12, 2020:
You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!
Looking at https://hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably /p/gpfs1/$USER
may be a better location for the default disk cache storage?
From an email by John Gyllenhaal on Mar 12, 2020:
You can also override the CUDA_CACHE_PATH by setting it before running jsrun/lrun/srun but please do not set it to a NFS mounted directory (do NOT use a home directory, a workspace directory, or a gapps directory)!
Looking at hpc.llnl.gov/documentation/tutorials/livermore-computing-resources-and-environment#file-systems and hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#file-systems, probably
/p/gpfs1/$USER
may be a better location for the default disk cache storage?
It is a little odd to default it opaquely to a platform-specific filesystem path. The previous setting was /tmp
and that's reasonable because it is platform-independent, node-local, and job-local. When we were given space in the LC workspace, they also asked that we not install packages there that use conda because of the typical sizes of the base installs. That includes MIRGE-Com, so we should also not be installing MIRGE-Com in those NFS places indicated above.
Where has CUDA_CACHE_PATH
been defaulting to all this time?
For most of us, and following system guidance, $(pwd)
from a job script will be somewhere down in /p/gpfs1/$USER
on lassen already, or on Lustre for quartz. Nobody should be running parallel jobs from NFS mounted file spaces. So I think defaulting to $(pwd)
is also reasonable
Questions for the review: