rmjarvis / TreeCorr

Code for efficiently computing 2-point and 3-point correlation functions. For documentation, go to
http://rmjarvis.github.io/TreeCorr/
Other
97 stars 37 forks source link

NNCorrelation error under MPI in 4.2.0 #127

Closed joezuntz closed 3 years ago

joezuntz commented 3 years ago

Hi Mike,

Sorry, another MPI-related error! My shear-shear and shear-position runs now work fine, but under MPI only I get this error with position-position.

I'm sure this is ultimately user error again somewhere, but any advice is helpful - this is an error in an auto-correlation, so the are coverage isn't an issue, and as far as I can tell the patches are all fine. The same lens catalogs work okay in the shear-position correlation.

It looks like something going wrong when unpickling something sent via MPI, and then a second error happens when the __del__ is called to clean up after the first, because the object isn't fully built.

Output on two processes below. I've stripped the repeated lines which are printed by both processes, for clarity. The exception only appears on the root process.

fname =  data/calibrated_shear_catalog.hdf5
nbins = 15, min,max sep = 2.5..100 arcmin, bin_size = 0.245925
Using split_method = mean
Using bin_slop = 0, b = 0
Finished building NNCorr
Reading input file data/calibrated_lens_catalog.hdf5
read ra
read dec
read w
Using w for wpos
Assigned patch numbers according 40 centers
   nobj = 28779
[ SNIP LOTS OF DOTS]
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/treecorr/nncorrelation.py", line 123, in __del__
    if self._corr is not None:
AttributeError: 'NNCorrelation' object has no attribute '_corr'
Traceback (most recent call last):
  File "nn_error.py", line 53, in <module>
    nn.process(cat, cat2, comm=comm)
  File "/usr/local/lib/python3.6/dist-packages/treecorr/nncorrelation.py", line 443, in process
    self._process_all_auto(cat1, metric, num_threads, comm, low_mem)
  File "/usr/local/lib/python3.6/dist-packages/treecorr/binnedcorr2.py", line 700, in _process_all_auto
    temp = comm.recv(source=p)
  File "mpi4py/MPI/Comm.pyx", line 1173, in mpi4py.MPI.Comm.recv
  File "mpi4py/MPI/msgpickle.pxi", line 302, in mpi4py.MPI.PyMPI_recv
  File "mpi4py/MPI/msgpickle.pxi", line 268, in mpi4py.MPI.PyMPI_recv_match
  File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
  File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
  File "/usr/local/lib/python3.6/dist-packages/treecorr/binnedcorr2.py", line 567, in __setstate__
    self.logger = setup_logger(get(self.config,'verbose',int,1),
AttributeError: 'NNCorrelation' object has no attribute 'config'

I've put code that can replicate in /global/cfs/cdirs/lsst/groups/WL/users/zuntz/treecorr-issue/nn_error.py. It can be run with:

# Get an interactive node
salloc -N 1 -C haswell  -t 2:00:00 -q interactive -A m1727

cd /global/cfs/cdirs/lsst/groups/WL/users/zuntz/treecorr-issue

# Run under MPI.  This shifter image has 4.2.0 installed
srun -n 2 -c 8 shifter --env OMP_NUM_THREADS=8  --image joezuntz/txpipe python nn_error.py jackknife mpi
rmjarvis commented 3 years ago

Sorry. This was my error this time. I'll have a fix shortly. Just trying to figure out why my existing MPI tests didn't catch this bug.

rmjarvis commented 3 years ago

v4.2.1 is released with the fix.