radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Module loading issues: Error in initializing MVAPICH2 ptmalloc library #87

Closed vivek-bala closed 9 years ago

vivek-bala commented 10 years ago

Inactive Modules:
  1) gromacs

The following have been reloaded with a version change:
  1) intel/13.0.2.146 => intel/14.0.1.106  2) mvapich2/1.9a2 => mvapich2/2.0b

The following have been reloaded with a version change:
  1) python/2.7.3-epd-7.3.2 => python/2.7.6

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
Traceback (most recent call last):
  File "/home1/02734/vivek91/lsdmap/bin/lsdmap", line 5, in <module>
    lsdm.LSDMap().run()
  File "/home1/02734/vivek91/.local/lib/python2.7/site-packages/lsdmap/lsdm.py", line 370, in run
    self.save_nneighbors(comm, args, DistanceMatrix, epsilon_thread)
  File "/home1/02734/vivek91/.local/lib/python2.7/site-packages/lsdmap/lsdm.py", line 239, in save_nneighbors
    neighbor_matrix = comm.recv(source=idx, tag=1000+idx)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:72032)
  File "pickled.pxi", line 250, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:29545)
  File "pickled.pxi", line 111, in mpi4py.MPI._p_Pickle.load (src/mpi4py.MPI.c:28058)
EOFError
[c463-103.stampede.tacc.utexas.edu:mpirun_rsh][signal_processor] Caught signal 15, killing job
jp43 commented 10 years ago

It seems to be an issue related to mpi4py when it tries to receive large numpy arrays, as described here: https://groups.google.com/forum/#!msg/mpi4py/OJG5eZ2f-Pg/EnhN06Ozg2oJ.

vivek-bala commented 10 years ago

Hmmm.. did you also get the same error on Stampede with the recent version ?

jp43 commented 10 years ago

Update from the discussion we had today with @vivek-bala . I have been able to reproduce the same error as @vivek-bala with the new version. However, it seems that the GROMACS/LSDMap pattern works fine using 16 cores instead of 64 cores (64 is the value specified by default in the file stampede.rcfg). Vivek reached the same conclusion and found that with 32, 64 cores, the error comes up. But not with 16.

oleweidner commented 10 years ago

Thanks for the update. What are the next steps to address this issue?

oleweidner commented 10 years ago

Can you please try to reproduce this with a regular PBS script? We need to see whether the problem lies one the ExTASY / Radical-Pilot, on the LSDMap or possibly on the Stampede end.

oleweidner commented 10 years ago

Have you opened a ticket with TACC?

jp43 commented 10 years ago

Not that I am aware of. Maybe Vivek has opened one.

oleweidner commented 10 years ago

Could be related to #101

vivek-bala commented 9 years ago

This is fixed now with /opt/apps/intel14/mvapich2_2_0/python/2.7.6/lib/python2.7/site-packages/mpi4py/bin/python-mpi.

samcom12 commented 1 year ago

@vivek-bala Hello,

May I know how you fixed it? We are also getting same error at our end.

Cheers, Samir Shaikh

andre-merzky commented 1 year ago

hi @samcom12 - that ticket is quite a blast from the past! :-P

If my memory served correctly, the problem was resolved by TACC staff providing a new mpi4py deployment. I do not know what was changed in that deployment, and I doubt that knowledge can be recovered at this point.