thouska / spotpy

A Statistical Parameter Optimization Tool
https://spotpy.readthedocs.io/en/latest/
MIT License
248 stars 150 forks source link

Bug: sceua gets stuck with MPI after burn-in #226

Closed MuellerSeb closed 4 years ago

MuellerSeb commented 5 years ago

Hey there,

from spotpy 1.5.0 on, sce optimization with MPI get stuck after the burn in phase. Here is a minimal example:

from spotpy.algorithms import sceua
from spotpy.examples.spot_setup_rosenbrock import spot_setup
setup = spot_setup("sceua")  # spot_setup() for spotpy 1.4.6
sampler = sceua(setup, parallel="mpi", dbname='db', dbformat="csv")
sampler.sample(repetitions=10000, ngs=4)

Running with

mpiexec -n 4 python3 test.py

Gives the following output:

Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Initializing the  Shuffled Complex Evolution (SCE-UA) algorithm  with  10000  repetitions
The objective function will be minimized
Starting burn-in sampling...
Initialize database...
['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData']
* Database file 'db.csv' created.
Burn-in sampling completed...
Starting Complex Evolution...
ComplexEvo loop #1 in progress...

And from there on, nothing more happens. With parallel="seq" it takes about 5 seconds to finish. Do you know what the problem could be?

I've got mpi4py 3.0.2 installed and I am using Python 3.6.8. With spotpy 1.4.6 everything is working. From 1.5.0 on the above mentioned behavior occurs.

Cheers, Sebastian

MuellerSeb commented 5 years ago

After some bug tracking I think, the problem is in this line: https://github.com/thouska/spotpy/blob/269a5a7435f1e45d7ad90bb32d4ed9df89f77943/spotpy/parallel/mpi.py#L200

where self.comm.Iprobe(source=i+1, tag=tag.answer) never evaluates to true. Maybe this is related to this: https://groups.google.com/forum/#!topic/mpi4py/RiK8Fhd3LIU

But I've run out of ideas at this point.

philippkraft commented 5 years ago

Hi Sebastian, sorry for the long silence - vacation period. We "fixed" some SCE-UA bugs with the last version, I have to check the changes together with @thouska - who is still out of office. Can you check another sampler, if you have the same problems there? (e.g. ROPE or LHS). Just to make sure it is in the SCE-UA implementation (which is tricky) and not a general parallel='mpi' problem.

MuellerSeb commented 5 years ago

@philippkraft : Thanks for the reply. I checked the FAST routine, which worked as expected.

MuellerSeb commented 5 years ago

Something new on this topic? Cheers, Sebastian

thouska commented 5 years ago

Hi Sebastian, unfortunatelly, there is not much new on this topic. At least I can confirm your error description. I am on it and will inform you here as soon as this is fixed. Sorry that it takes so long... Based on your report, we are also working to test the mpi implementation on travis (#231), so that such erros can, hopefully, be avoided in the future.

thouska commented 5 years ago

Ok, now it should be fixed. Somehow this in spotpy version 1.5.0 introduced new design of the _RunStatistic class in _algorithm.py was not pickable under mpi4py. This resulted your described stuck after the burn-in phase. I removed the use of the _RunStatistic class while spotpy is running on cpu-slaves. This fixes the problem (at least in my mpi environment). The change might result in a bit longer runtimes at the end of the sampling (will be fixed), but for now it is at least running again.

thouska commented 5 years ago

PS: If you want to test this, the corresponding new version (1.5.3) of spotpy is available on pypi.

MuellerSeb commented 5 years ago

I installed spotpy 1.5.4 and now I am getting the following error:

  File "/usr/local/lib/python3.6/dist-packages/spotpy/__init__.py", line 41, in <module>
    from . import unittests
ImportError: cannot import name 'unittests'

The submodule unittests is missing in the package. This is due to this line in the setup.py: https://github.com/thouska/spotpy/blob/0d550741d6d5e882e119e1c7ca140b4be8ffa644/setup.py#L16

you should use this instead:

packages=find_packages(exclude=["tests*", "docs*"])

with this on the first line:

from setuptools import setup, find_packages

But after commenting out the from . import unittests it now works.

MuellerSeb commented 5 years ago

Maybe you could shift the unittests folder to a toplevel folder named tests, as mentioned in the exclude pattern, which is a common way, Than you have to adopt the .travis.yml file. I dont think the unit tests need to be in the package when there is a separate example folder.

hpsone commented 5 years ago

I had similar problems but I just saw @thouska just updated but I mean [I have not] test it out the newest version. :D . I will do it now. :D

thouska commented 5 years ago

Many thanks @MuellerSeb that you directly tested everything and reported such a detailed way how to fix the new problems. As you recommended, I removed the unittest import, renamed the unittests folder to tests and moved the whole thing to the toplevel. I like the new structure and think this makes totaly sense. As @hpsone found out faster than I could answer to this issue: There is a new version on pypi containing the fix.

hpsone commented 5 years ago

Sorry for my rush comment. I want to say I have not tested it yet. But now I tested it and it is not working for me. May be it is my mistake in the model but my mpi is working properly as I tested it with Telemac2d. What could be the possible error. Anyway, @thouska thank you very much for help. Best Regards Htun

MuellerSeb commented 4 years ago

@hpsone : maybe you have to give some details on your problem to get an answer.

hpsone commented 4 years ago

@MuellerSeb Thank you so much. I am not quite sure what is the error. But I did run using "mpc" instead of "mpi" and it worked. Anyway I will try again but it probably might be my insufficient knowledge.

thouska commented 4 years ago

I guess this issue is solved, if not feel free to reopen.