hpc4cmb / toast

Time Ordered Astrophysics Scalable Tools
Other
44 stars 39 forks source link

PySM astropy.utils.data cache locking #330

Open tskisner opened 4 years ago

tskisner commented 4 years ago

When running PySM, data files are downloaded using astropy.utils.data, which caches files in ~/.astropy. When multiple MPI groups are used in toast, there are multiple processes trying to download and read files. Even if the maps exist in the cache (say from a previous run), then if multiple processes try to read these files at the same time it produces an astropy exception that the cache was locked for more than 5 seconds (the default locking time).

Here is a tarball of a self-contained example on cori.nersc.gov.

job_nersc-cori-knl_satellite_small.tar.gz

A typical traceback from master branch is:

Proc 45: Traceback (most recent call last):
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/bin/toast_satellite_sim.py", line 477, in <module>
    main()
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/bin/toast_satellite_sim.py", line 341, in main
    args, comm, data, [focalplane], "signal"
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/timing.py", line 43, in df
    return f(*args, **kwargs)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/pipeline_tools/sky_signal.py", line 358, in simulate_sky_signal
    pixels=pixels,
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/timing.py", line 43, in df
    return f(*args, **kwargs)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/todmap/sim_det_pysm.py", line 142, in __init__
    map_dist=map_dist,
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/todmap/pysm.py", line 76, in __init__
    if init_sky
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/timing.py", line 43, in df
    return f(*args, **kwargs)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/toast/todmap/pysm.py", line 124, in init_sky
    output_unit=self._units,
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/pysm-3.1.dev0-py3.6.egg/pysm/sky.py", line 101, in __init__
    component_config, nside=nside, map_dist=map_dist
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/pysm-3.1.dev0-py3.6.egg/pysm/sky.py", line 46, in create_components_from_config
    **remove_class_from_dict(model_config), nside=nside, map_dist=map_dist
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/pysm-3.1.dev0-py3.6.egg/pysm/models/power_law.py", line 52, in __init__
    self.I_ref = self.read_map(map_I, unit=unit_I)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/pysm-3.1.dev0-py3.6.egg/pysm/models/template.py", line 65, in read_map
    dataurl=self.dataurl,
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_aux/lib/python3.6/site-packages/pysm-3.1.dev0-py3.6.egg/pysm/models/template.py", line 200, in read_map
    filename = data.get_pkg_data_filename(path)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 627, in get_pkg_data_filename
    conf.dataurl_mirror + data_name])
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file
    with _cache(pkgname) as (dldir, url2hash):
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1690, in _cache
    with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash:
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
Proc 45:   File "/global/common/software/cmb/cori/cmbenv-intel_20200115/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1591, in _cache_lock
    raise RuntimeError(msg)
Proc 45: RuntimeError: Cache is locked after 5.00 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/cscratch1/sd/kisner/astropy/download/py3/lock. Lock claims to be held by process 158229.

I am trying to fix this in another branch, but even serializing the download and the reading by different groups produces the cache lock error above. Perhaps a larger question is: is there any way to just tell pysm which model files to use? By specifying the path on disk?

zonca commented 4 years ago

that is already implemented in so_pysm_models, which has bigger files:

https://github.com/simonsobs/so_pysm_models/blob/master/so_pysm_models/utils/__init__.py#L47

Next I'll be porting this into PySM proper, when I'll bring all the so_pysm_models to pysm. If you need it urgently, I can give priority to this small fix.

331 is un-necessary, please do not merge it.

zonca commented 4 years ago

I switched to be notified only if mentioned, so please tag me for relevant issues.

tskisner commented 4 years ago

Thanks @zonca, I'll disable PySM temporarily in my branch for #332, so that I can make progress on that. As soon as you have a new release I'll update the packages at NERSC and retest.

zonca commented 4 years ago

ok, PySM 3.2.0 should fix this issue, PyPI updated, conda package is building

tskisner commented 4 years ago

Hello @zonca, I confirm that PySM data files now are loaded from disk explicitly. I had to do:

export PYSM_LOCAL_DATA=/global/cfs/cdirs/cmb/www/pysm-data

and also disable UserWarning for PySM:

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pysm")

Or else the logs were filled with hundreds of lines warning about the use of the manual data cache. I am now seeing the astropy data cache lock error from healpy:

Proc 55: Traceback (most recent call last):
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 524, in <module>
    main()
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 347, in main
    args, comm, data, [focalplane], "signal"
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df
    return f(*args, **kwargs)
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/pipeline_tools/sky_signal.py", line 360, in simulate_sky_signal
    op_sim_pysm.exec(data)
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df
    return f(*args, **kwargs)
Proc 55:   File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/todmap/sim_det_pysm.py", line 242, in exec
    full_map_rank0, use_pixel_weights=True
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/rotator.py", line 417, in rotate_map_alms
    m, use_pixel_weights=use_pixel_weights, lmax=lmax, mmax=mmax
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/sphtfunc.py", line 216, in map2alm
    package="healpy",
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 627, in get_pkg_data_filename
    conf.dataurl_mirror + data_name])
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file
    with _cache(pkgname) as (dldir, url2hash):
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1694, in _cache
    with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash:
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
Proc 55:   File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1595, in _cache_lock
    raise RuntimeError(msg)
Proc 55: RuntimeError: Cache is locked after 5.05 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/k/kisner/.astropy/cache/download/py3/lock. Lock claims to be held by process 156638.

The error comes from healpy downloading the pixel weight files. Remember there are multiple groups of processes where rank zero in each group is attempting to download the same files. I guess this will require some hack to trigger the download on one process before multiple processes try to read the file.

zonca commented 4 years ago

Ok can you open an issue on healpy with minimal test case?

On Thu, Apr 16, 2020, 00:07 Theodore Kisner notifications@github.com wrote:

Hello @zonca https://github.com/zonca, I confirm that PySM data files now are loaded from disk explicitly. I had to do:

export PYSM_LOCAL_DATA=/global/cfs/cdirs/cmb/www/pysm-data

and also disable UserWarning for PySM:

import warnings warnings.filterwarnings("ignore", category=UserWarning, module="pysm")

Or else the logs were filled with hundreds of lines warning about the use of the manual data cache. I am now seeing the astropy data cache lock error from healpy:

Proc 55: Traceback (most recent call last): Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 524, in main() Proc 55: File "/global/common/software/cmb/kisner/cori/toast/bin/toast_satellite_sim.py", line 347, in main args, comm, data, [focalplane], "signal" Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df return f(*args, *kwargs) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/pipeline_tools/sky_signal.py", line 360, in simulate_sky_signal op_sim_pysm.exec(data) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/timing.py", line 43, in df return f(args, **kwargs) Proc 55: File "/global/common/software/cmb/kisner/cori/toast/lib/python3.6/site-packages/toast/todmap/sim_det_pysm.py", line 242, in exec full_map_rank0, use_pixel_weights=True Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/rotator.py", line 417, in rotate_map_alms m, use_pixel_weights=use_pixel_weights, lmax=lmax, mmax=mmax Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_aux/lib/python3.6/site-packages/healpy-1.12.10-py3.6-linux-x86_64.egg/healpy/sphtfunc.py", line 216, in map2alm package="healpy", Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 627, in get_pkg_data_filename conf.dataurl_mirror + data_name]) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1127, in download_file with _cache(pkgname) as (dldir, url2hash): Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1694, in _cache with _cache_lock(pkgname), shelve.open(urlmapfn, flag="r") as url2hash: Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/contextlib.py", line 81, in enter return next(self.gen) Proc 55: File "/global/common/software/cmb/cori/cmbenv-gcc_20200408/cmbenv_python/lib/python3.6/site-packages/astropy/utils/data.py", line 1595, in _cache_lock raise RuntimeError(msg) Proc 55: RuntimeError: Cache is locked after 5.05 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/k/kisner/.astropy/cache/download/py3/lock. Lock claims to be held by process 156638.

The error comes from healpy downloading the pixel weight files. Remember there are multiple groups of processes where rank zero in each group is attempting to download the same files. I guess this will require some hack to trigger the download on one process before multiple processes try to read the file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/330#issuecomment-614456485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4UNUWNV5SS4FVAYUZDRM2VDXANCNFSM4MHYYUFQ .

tskisner commented 4 years ago

I can- but this comes about due to two processes on different nodes trying to download the same file to the same directory on a shared disk. Does healpy care about that? Or is the answer to pre-download all possible healpy data files?

zonca commented 4 years ago

It depends on how difficult it is to implement

On Thu, Apr 16, 2020, 00:16 Theodore Kisner notifications@github.com wrote:

I can- but this comes about due to two processes on different nodes trying to download the same file to the same directory on a shared disk. Does healpy care about that? Or is the answer to pre-download all possible healpy data files?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hpc4cmb/toast/issues/330#issuecomment-614460272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4VUUM6XRBEXBTINAADRM2WDRANCNFSM4MHYYUFQ .