Open tskisner opened 4 years ago
that is already implemented in so_pysm_models
, which has bigger files:
https://github.com/simonsobs/so_pysm_models/blob/master/so_pysm_models/utils/__init__.py#L47
Next I'll be porting this into PySM proper, when I'll bring all the so_pysm_models
to pysm
. If you need it urgently, I can give priority to this small fix.
I switched to be notified only if mentioned, so please tag me for relevant issues.
Thanks @zonca, I'll disable PySM temporarily in my branch for #332, so that I can make progress on that. As soon as you have a new release I'll update the packages at NERSC and retest.
ok, PySM 3.2.0 should fix this issue, PyPI updated, conda package is building
Hello @zonca, I confirm that PySM data files now are loaded from disk explicitly. I had to do:
export PYSM_LOCAL_DATA=/global/cfs/cdirs/cmb/www/pysm-data
and also disable UserWarning for PySM:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pysm")
Or else the logs were filled with hundreds of lines warning about the use of the manual data cache. I am now seeing the astropy data cache lock error from healpy:
Proc 55: Traceback (most recent call last):
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 524, in <module>
main()
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 347, in main
args, comm, data, [focalplane], "signal"
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df
return f(*args, **kwargs)
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/pipeline_tools/sky_signal.py", line 360, in simulate_sky_signal
op_sim_pysm.exec(data)
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df
return f(*args, **kwargs)
Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/todmap/sim_det_pysm.py", line 242, in exec
full_map_rank0, use_pixel_weights=True
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/rotator.py", line 417, in rotate_map_alms
m, use_pixel_weights=use_pixel_weights, lmax=lmax, mmax=mmax
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/sphtfunc.py", line 216, in map2alm
package="healpy",
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 627, in get_pkg_data_filename
conf.dataurl_mirror + data_name])
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file
with _cache(pkgname) as (dldir, url2hash):
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1694, in _cache
with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash:
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1595, in _cache_lock
raise RuntimeError(msg)
Proc 55: RuntimeError: Cache is locked after 5.05 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/k/kisner/.astropy/cache/download/py3/lock. Lock claims to be held by process 156638.
The error comes from healpy downloading the pixel weight files. Remember there are multiple groups of processes where rank zero in each group is attempting to download the same files. I guess this will require some hack to trigger the download on one process before multiple processes try to read the file.
Ok can you open an issue on healpy with minimal test case?
On Thu, Apr 16, 2020, 00:07 Theodore Kisner notifications@github.com wrote:
Hello @zonca https://github.com/zonca, I confirm that PySM data files now are loaded from disk explicitly. I had to do:
export PYSM_LOCAL_DATA=/global/cfs/cdirs/cmb/www/pysm-data
and also disable UserWarning for PySM:
import warnings warnings.filterwarnings("ignore", category=UserWarning, module="pysm")
Or else the logs were filled with hundreds of lines warning about the use of the manual data cache. I am now seeing the astropy data cache lock error from healpy:
Proc 55: Traceback (most recent call last): Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 524, in
main() Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 347, in main args, comm, data, [focalplane], "signal" Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df return f(*args, *kwargs) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/pipeline_tools/sky_signal.py", line 360, in simulate_sky_signal op_sim_pysm.exec(data) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df return f(args, **kwargs) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/todmap/sim_det_pysm.py", line 242, in exec full_map_rank0, use_pixel_weights=True Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/rotator.py", line 417, in rotate_map_alms m, use_pixel_weights=use_pixel_weights, lmax=lmax, mmax=mmax Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/sphtfunc.py", line 216, in map2alm package="healpy", Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 627, in get_pkg_data_filename conf.dataurl_mirror + data_name]) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file with _cache(pkgname) as (dldir, url2hash): Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1694, in _cache with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash: Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1595, in _cache_lock raise RuntimeError(msg) Proc 55: RuntimeError: Cache is locked after 5.05 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/k/kisner/.astropy/cache/download/py3/lock. Lock claims to be held by process 156638. The error comes from healpy downloading the pixel weight files. Remember there are multiple groups of processes where rank zero in each group is attempting to download the same files. I guess this will require some hack to trigger the download on one process before multiple processes try to read the file.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/330#issuecomment-614456485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4UNUWNV5SS4FVAYUZDRM2VDXANCNFSM4MHYYUFQ .
I can- but this comes about due to two processes on different nodes trying to download the same file to the same directory on a shared disk. Does healpy care about that? Or is the answer to pre-download all possible healpy data files?
It depends on how difficult it is to implement
On Thu, Apr 16, 2020, 00:16 Theodore Kisner notifications@github.com wrote:
I can- but this comes about due to two processes on different nodes trying to download the same file to the same directory on a shared disk. Does healpy care about that? Or is the answer to pre-download all possible healpy data files?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/330#issuecomment-614460272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4VUUM6XRBEXBTINAADRM2WDRANCNFSM4MHYYUFQ .
When running PySM, data files are downloaded using astropy.utils.data, which caches files in ~/.astropy. When multiple MPI groups are used in toast, there are multiple processes trying to download and read files. Even if the maps exist in the cache (say from a previous run), then if multiple processes try to read these files at the same time it produces an astropy exception that the cache was locked for more than 5 seconds (the default locking time).
Here is a tarball of a self-contained example on cori.nersc.gov.
job_nersc-cori-knl_satellite_small.tar.gz
A typical traceback from master branch is:
I am trying to fix this in another branch, but even serializing the download and the reading by different groups produces the cache lock error above. Perhaps a larger question is: is there any way to just tell pysm which model files to use? By specifying the path on disk?