Sorry, another MPI-related error! My shear-shear and shear-position runs now work fine, but under MPI only I get this error with position-position.
I'm sure this is ultimately user error again somewhere, but any advice is helpful - this is an error in an auto-correlation, so the are coverage isn't an issue, and as far as I can tell the patches are all fine. The same lens catalogs work okay in the shear-position correlation.
It looks like something going wrong when unpickling something sent via MPI, and then a second error happens when the __del__ is called to clean up after the first, because the object isn't fully built.
Output on two processes below. I've stripped the repeated lines which are printed by both processes, for clarity. The exception only appears on the root process.
fname = data/calibrated_shear_catalog.hdf5
nbins = 15, min,max sep = 2.5..100 arcmin, bin_size = 0.245925
Using split_method = mean
Using bin_slop = 0, b = 0
Finished building NNCorr
Reading input file data/calibrated_lens_catalog.hdf5
read ra
read dec
read w
Using w for wpos
Assigned patch numbers according 40 centers
nobj = 28779
[ SNIP LOTS OF DOTS]
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/treecorr/nncorrelation.py", line 123, in __del__
if self._corr is not None:
AttributeError: 'NNCorrelation' object has no attribute '_corr'
Traceback (most recent call last):
File "nn_error.py", line 53, in <module>
nn.process(cat, cat2, comm=comm)
File "/usr/local/lib/python3.6/dist-packages/treecorr/nncorrelation.py", line 443, in process
self._process_all_auto(cat1, metric, num_threads, comm, low_mem)
File "/usr/local/lib/python3.6/dist-packages/treecorr/binnedcorr2.py", line 700, in _process_all_auto
temp = comm.recv(source=p)
File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
File "mpi4py/MPI/msgpickle.pxi", line 302, in mpi4py.MPI.PyMPI_recv
File "mpi4py/MPI/msgpickle.pxi", line 268, in mpi4py.MPI.PyMPI_recv_match
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
File "/usr/local/lib/python3.6/dist-packages/treecorr/binnedcorr2.py", line 567, in __setstate__
self.logger = setup_logger(get(self.config,'verbose',int,1),
AttributeError: 'NNCorrelation' object has no attribute 'config'
I've put code that can replicate in /global/cfs/cdirs/lsst/groups/WL/users/zuntz/treecorr-issue/nn_error.py. It can be run with:
# Get an interactive node
salloc -N 1 -C haswell -t 2:00:00 -q interactive -A m1727
cd /global/cfs/cdirs/lsst/groups/WL/users/zuntz/treecorr-issue
# Run under MPI. This shifter image has 4.2.0 installed
srun -n 2 -c 8 shifter --env OMP_NUM_THREADS=8 --image joezuntz/txpipe python nn_error.py jackknife mpi
Hi Mike,
Sorry, another MPI-related error! My shear-shear and shear-position runs now work fine, but under MPI only I get this error with position-position.
I'm sure this is ultimately user error again somewhere, but any advice is helpful - this is an error in an auto-correlation, so the are coverage isn't an issue, and as far as I can tell the patches are all fine. The same lens catalogs work okay in the shear-position correlation.
It looks like something going wrong when unpickling something sent via MPI, and then a second error happens when the
__del__
is called to clean up after the first, because the object isn't fully built.Output on two processes below. I've stripped the repeated lines which are printed by both processes, for clarity. The exception only appears on the root process.
I've put code that can replicate in
/global/cfs/cdirs/lsst/groups/WL/users/zuntz/treecorr-issue/nn_error.py
. It can be run with: