thouska / spotpy

A Statistical Parameter Optimization Tool
https://spotpy.readthedocs.io/en/latest/
MIT License
254 stars 152 forks source link

mpi and sce #202

Closed MuellerSeb closed 5 years ago

MuellerSeb commented 5 years ago

Hey there, when I run the sce algorithm in parallel, the burn-in phase works just right, but afterwards it seems, that there is something going wrong with the parallel processes. The estimated time goes up and the run numbers start to repeat and are sometimes not in an increasing order:

Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
Starting the SCE-UA algorithm with 10000 repetitions...
burn-in sampling started...
Initialize database...
* Database file 'all_hh/2019-02-21_18-14-50_stat2D_db.csv' created.
13 of 10000 (best like=-23.6045) est. time remaining: 00:24:00
30 of 10000 (best like=-23.6045) est. time remaining: 00:22:26
48 of 10000 (best like=-22.1554) est. time remaining: 00:21:14
66 of 10000 (best like=-21.6666) est. time remaining: 00:20:35
84 of 10000 (best like=-21.6666) est. time remaining: 00:20:06
103 of 10000 (best like=-18.4106) est. time remaining: 00:19:48
115 of 10000 (best like=-18.4106) est. time remaining: 00:20:46
127 of 10000 (best like=-18.4106) est. time remaining: 00:21:31
141 of 10000 (best like=-18.4106) est. time remaining: 00:21:50
154 of 10000 (best like=-18.4106) est. time remaining: 00:22:12
173 of 10000 (best like=-18.4106) est. time remaining: 00:21:49
burn-in sampling completed...
ComplexEvo started...
ComplexEvo loop #1 in progress...
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:29
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:35
183 of 10000 (best like=-18.4106) est. time remaining: 00:22:36
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:16
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:26
187 of 10000 (best like=-18.4106) est. time remaining: 00:24:27
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:31
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:34
189 of 10000 (best like=-18.4106) est. time remaining: 00:26:36
181 of 10000 (best like=-18.4106) est. time remaining: 00:29:21
192 of 10000 (best like=-18.4106) est. time remaining: 00:27:54
192 of 10000 (best like=-18.4106) est. time remaining: 00:29:29
208 of 10000 (best like=-18.4106) est. time remaining: 00:27:18
202 of 10000 (best like=-18.4106) est. time remaining: 00:28:22
195 of 10000 (best like=-18.4106) est. time remaining: 00:30:49
211 of 10000 (best like=-18.4106) est. time remaining: 00:28:30
206 of 10000 (best like=-18.4106) est. time remaining: 00:29:33
215 of 10000 (best like=-18.4106) est. time remaining: 00:29:29
200 of 10000 (best like=-18.4106) est. time remaining: 00:31:58
...

Do you may have a guess, what's going wrong? Thanks in advance!

thouska commented 5 years ago

Hi Sebastian, thanks for your message. This behavior is indeed a bit strange. However, I am pretty sure it is “just” a parallel printing issue. Still, it needs to be solved. As I recently changed some minor things in the parallelization of sce-ua, I need to make sure, that we are using the same version. I just uploaded a new version on pypi (1.4.5). Would you be so kind and test this one again? If the bug is still persisting in this version, I will look closer into this.

MuellerSeb commented 5 years ago

I just updated spotpy, but the behavior is the same. Also the estimated time goes up to the same estimated time with sequential optimization. Again: Burn-in phase works just right. From ComplexEvo loop #1 on repetition counter becomes unsorted and the estimated time goes up to the sequential time estimate. I don't know, what I could have done wrong on my side.

thouska commented 5 years ago

Perfect, thanks for testing this again. I found some lines in the code of sceua, where the slaves during parallel computing and complex evolution had access to the status of algorithm.py. As the slaves can have different speed, this might have mixed up the tracked repetitions and time shown in the printing message. I think this should be fixed now.

p-lauer commented 5 years ago

I would assume that the unsorted repetition counter is based on how mpi is implemented. The like values are printed in the mpi-loop, so if a process with a higher repetition count finishes before one with a lower repetition count, it is first printed to the screen.

MuellerSeb commented 5 years ago

I think this is clarified. But a new issue came up after version 1.5.0: https://github.com/thouska/spotpy/issues/226