Issue running in parallel

LukeMondy commented 6 years ago

I was just trying to reproduce the issue #65 , but now I face a different issue.

I take one of these models: https://github.com/EarthByte/UW2-tests-and-benchmarks/blob/master/isostasy/Isostasy%205%20-%20weak%20centre%20with%20sediments%20-%20PressureBC.ipynb and export it to a python file.

When I run it inside the latest underworld2_geodynamics (latest or dev), with this: python isostasy.py it works fine.

When I run it with: mpirun -np 2 python isostasy.py # or any other number of processes I get:

mpirun -np 2 python pt2.py 
    Global element size: 25x25
    Local offset of rank 0: 0x0
    Local range of rank 0: 25x13
Linear solver (J0OZ9Q5U__system-execute) 

BSSCR -- Block Stokes Schur Compliment Reduction Solver 
AUGMENTED LAGRANGIAN K2 METHOD - Penalty = 0.000000

  Setting schur_pc to "uw" 

SCR Solver Summary:

  RHS V Solve:            = 0.002512 secs / 1 its
  Pressure Solve:         = 0.1921 secs / 81 its
  Final V Solve:          = 0.002317 secs / 1 its

  Total BSSCR Linear solve time: 0.226781 seconds

Linear solver (J0OZ9Q5U__system-execute), solution time 2.283564e-01 (secs)
loaded rc file /opt/UWGeodynamics/UWGeodynamics/uwgeo-data/uwgeodynamicsrc
An uncaught exception was encountered on processor 0.
RuntimeError: Failed to execute the callback function, please check if it's valid

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pt2.py", line 184, in <module>
    Model.solve()
  File "/opt/UWGeodynamics/UWGeodynamics/_model.py", line 1470, in solve
    nonLinearTolerance=self._curTolerance)
  File "/opt/underworld2/underworld/timing.py", line 323, in timed
    return routine(*args, **kwargs)
  File "/opt/underworld2/underworld/systems/_bsscr.py", line 451, in solve
    libUnderworld.StgFEM.SystemLinearEquations_UpdateSolutionOntoNodes(self._stokesSLE._cself, None)
SystemError: <built-in function SystemLinearEquations_UpdateSolutionOntoNodes> returned a result with an error set
application called MPI_Abort(comm=0x84000004, 1) - process 0

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 130 RUNNING AT 7ff61ef05b11
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================

If I change the model resolution to be higher, like 100x100, or 96x96, I see the same result.

LukeMondy commented 6 years ago

Removing the q2/dq1 elements has the same result. Removing the SLCN has the same result.

rbeucher commented 6 years ago

Weird. I will debug this later this week... I have been having issues with parallelism since moving to python3.... There is a python2 image on docker hub if you want to try. The error points the callback function that is called just after the solver call....Could be related to your "custom" solver...I would expect it not to work in serial though....

rbeucher commented 6 years ago

Note that the dQ2/dQ1 combination is largely untested!!!

LukeMondy commented 6 years ago

Yeah, that's fine, I'm not using it my main models.

LukeMondy commented 6 years ago

I agree that the python2 version seems to work much better in parallel - both in terms of speed up, and proper output. For example, the python3 version would not output the timesteps, or any print statement in my input script, until about 80 timesteps passed - and then it dumped them all out at once.

It looks like the python2 version hasn't been updated to the latest master version on dockerhub, since it doesn't have the slade solver, for example. Would you be able to bump it?

arijitlaik commented 6 years ago

@LukeMondy I have seen the same prblems with py3 with underworld as well, no output of the print will the model end and or crashes. its about 1.6 ~ 1.8 times slower than py2 implementation of underworld as welll as UWGeo

rbeucher commented 6 years ago

I have fixed the print statement yesterday. As for the slowness... I still have to look what's going on. Have you got an example?

rbeucher commented 6 years ago

So @arijitlaik, you said your underworld models are slower. So it's not just UWGeo?

Might be good to flag that on the underworld repo

arijitlaik commented 6 years ago

yes uw models are slow in python3 not just UWGeo, i will do that, with a uw.timing example. soon. busy reading stuff for a while

rbeucher commented 6 years ago

OK so for the print statement. You need to explicitly flush them to screen using flush=True in the print statement itself... That's python 3 specific. You can also run the model using python -u which will prevent buffering the strings..

arijitlaik commented 6 years ago

Ya, figured that out.

arijitlaik commented 6 years ago

I am going to use the v27 and development docker for timing tests and put them up.

underworldcode / UWGeodynamics

Issue running in parallel #69