Parallel evaluation fails depending on number of cores

daschaich commented 8 years ago

I've been wrestling with parallelizing a fairly straightforward computation. One strange problem that I have not been able to resolve is that my computation sometimes fails with a ValueError, apparently depending on the number of cores on which I run. I've reproduced this on a couple of different clusters, using mvapich, mvapich2 and openmpi with a couple different python 2.7.x. (I am only trying to run on multiple cores within single nodes on these clusters.)

Here is what I get from a simplified test (included below) using openmpi-1.6.5 and python-2.7.5. This output is especially interesting because (unlike mvapich[2]) openmpi complains about a fork() before spitting out the ValueError:

$ for i in 1 2 4 6 8 ; do echo -ne "$i " ; mpirun -np $i python par_test.py ; done
1 result = 0.241(43)    Q = 0.99
2 result = 0.241(43)    Q = 0.99
4 result = 0.241(43)    Q = 0.99
6 result = 0.241(43)    Q = 0.99
8 --------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          [redacted] (PID 30330)
  MPI_COMM_WORLD rank: 7

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "par_test.py", line 33, in <module>
    main()
  File "par_test.py", line 26, in main
    integ(fparallel, nitn=10, neval=100)             # Initial adaptation
  File "src/vegas/_vegas.pyx", line 1204, in random_batch (src/vegas/_vegas.c:20590)
  File "src/vegas/_vegas.pyx", line 1931, in vegas._vegas.VegasIntegrand.__init__.eval (src/vegas/_vegas.c:35513)
  File "src/vegas/_vegas.pyx", line 1998, in vegas._vegas._BatchIntegrand_from_Batch.__call__ (src/vegas/_vegas.c:37623)
  File "src/vegas/_vegas.pyx", line 2266, in vegas._vegas.MPIintegrand.__call__ (src/vegas/_vegas.c:41705)
  File "src/vegas/_vegas.pyx", line 2253, in vegas._vegas.MPIintegrand.__call__.fcn (src/vegas/_vegas.c:40768)
ValueError: total size of new array must be unchanged

At this point I'm out of ideas and just submitting this to see if anybody else might have any insight. Here is the test I constructed by simplifying the computation I'm interested in running:

#!/usr/bin/python
from __future__ import print_function
import numpy as np
import vegas

# Test parallel evaluation
# Global variables
twopi = 2.0 * np.pi
piSq = np.pi * np.pi

# Function to integrate
def f(p):
  num = np.zeros(p.shape[0], dtype = np.float)
  denom = np.zeros_like(num)
  for i in range(p.shape[0]):
    num[i] = np.cos(twopi * (p[i].sum()))
    denom[i] = ((np.sin(np.pi * p[i]))**2).sum()
  return piSq * num / denom

def main():
  integ = vegas.Integrator(4 * [[0.0, 1.0]])
  # Convert to MPI integrand
  fparallel = vegas.MPIintegrand(f)
  integ(fparallel, nitn=10, neval=100)             # Initial adaptation
  result = integ(fparallel, nitn=10, neval=1000)   # Actual estimation

  if fparallel.rank == 0:
    print("result = %s    Q = %.2f" % (result, result.Q))

if __name__ == '__main__':
  main()

gplepage commented 8 years ago

This is certainly a bug. I believe the problem is that the number of evaluations per iteration (neval) is too small for the number of processors you are specifying. I can reproduce your problem on my machine, but it goes away when I set neval=1e3 in both calls to integ(...) (not just the second call).

vegas should protect you from this problem and will in the next version. I rarely use neval values smaller than 1000 (usually much larger) and so hadn't noticed the problem before. I don't know exactly why it is failing where it does, but it happens with my test code as well when I use large numbers of processors (eg, 32). Using larger neval values does seem to fix it (assuming you want and need larger neval). I will try to have a corrected version up within a week or two.

Thanks for reporting the problem.

daschaich commented 8 years ago

Thanks for the tip. I am indeed interested in larger neval for which serial running becomes painful. I initially encountered the ValueError when testing neval=1e4 (for the initial adaptation) on 32 cores, and now confirmed that it is avoided when I increase that to neval=2.5e5.

daschaich commented 8 years ago

Quick follow-up: Larger neval haven't completely cured my full computation (as opposed to the simplified script above). Some further 32-core tests hit the same ValueError with neval=8.5e7 and 9.5e7, even while other tests with smaller neval=7.5e7 are running without a problem (so far).

gplepage commented 8 years ago

Thanks for reporting back. I think I know what the bug is and will post a repaired version in the next day or two. The error is such that it is impossible to predict ahead of time when it will arise --- it depends on the shape of the integrand, among other things.

gplepage commented 8 years ago

v3.1 should fix your problem. Let me know how it works out.

daschaich commented 8 years ago

That seems to have done it -- after updating to v3.1 I ran about a dozen jobs overnight with no ValueErrors. Previously most of these computations encountered that problem, so I'm confident the issue has been resolved. I'll go ahead and close it with this comment. Thanks!

gplepage / vegas

Parallel evaluation fails depending on number of cores #5