correct way to spawn a subprocess when running SCE-UA using multiprocess mpi

iacopoff commented 5 years ago

Hi, I am calibrating an hydrological model (VIC model) using SCE-UA.

Below a short description of the way the model is run from the calibration.py script: Within the spotpy_setup class, under the simulation method, I call a model class which then uses the subprocess.run() function to run the VIC executable.

This works fine when I am running the script on a single core.

However, when i try to run the sceua algorithm in parallel (argument parallel="mpi"), writing on the terminal mpiexec -np python calibration.py, I get the following error:

_-------------------------------------------------------------------------- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: mgmt01 (PID 23174) MPI_COMM_WORLD rank: 1

If you are absolutely sure that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_onfork MCA parameter to 0. --------------------------------------------------------------------------

I reckon this is because I am using subprocess.run().

Does anyone have encountered the same issue and would like to share a solution? maybe using mpi4py itself (i am not familiar with multiprocessing, but about to dive into it)?

thanks!

iacopo

iacopoff commented 5 years ago

Update: Using openmpi 3 (I was using an older version) does avoid the error about forking, however it hangs indefinitely without doing much (CPUs barely used ~ 3-4 %) and not running the model.

cheers,

iacopo

philippkraft commented 5 years ago

Hi iacopo,

obviously we are using os.system which should, according to your error message, have the same problems:

https://github.com/thouska/spotpy/blob/300d58be0df662dfa8beda16594a95925eebdf35/spotpy/examples/spot_setup_hymod_exe.py#L73

But we never saw this error before. The error must stem from mpi4py, because it is not programmed by us but knows about python - using mpi4py without spotpy will not solve that problem. After your update to openmpi 3, have you recompiled mpi4py?

For debugging, I would suggest to put print (or other logging) statements into the simulation function around the os.fork call, so that you see what is happening or at least where it hangs.

philippkraft commented 5 years ago

An alternative would be (not tested) to use the Python driver of VIC. It uses VIC as a CPython extension and is then executed in the same process. I am sure it would be possible with that approach to optimize I/O usage by loading driver data at model wrapper initialization instead of each single model run (see https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_python.py).

For HBVlight we could get > 10 times faster execution by this optimization.

iacopoff commented 5 years ago

Philipp, thanks for your answers. Yes I have recompiled it, but still no improvement.

I agree with you that the culprit here is either mpi4py or probably VIC. I might think to test the VIC python driver although it is still in development phase!

I will update here, I think it could be useful for others that are using VIC 5.0.1 !

iacopo

philippkraft commented 5 years ago

Hi iacopo,

I think it would be great to find the source of the problem:

a) in MPI4PY b) in the way you are using subprocess.run (my bet) c) in VIC

after a close look to our parallel hymode example i doubt if we ever tested it with MPI.

@thouska: can you share us your code starting ldndc on the hpc? Do you use os.system also?

Perhaps he can share that part of his model wrapper, since that works in parallel on mpi. (OpenMPI 3.0.0 in our case).

iacopoff commented 5 years ago

Regarding point b), I have tried os.system() but without success. Also, after the subprocess.run() that calls VIC, I am using another subprocess.run() to call the routing model (a separate executable from VIC) and that one does run without problems.

Therefore I am not sure here what is the issue but it seems VIC is causing the problem. I have also recompiled VIC 5.0.1 with OpenMPI 3.0.0 (even if I am not using VIC's in-built parallelisation) but no improvement.

I was wondering, does the SCE-UA algorithm runs with other multi-process modules other that OpenMPI? I am not planning at the moment to run VIC on a cluster....

thanks

iacopoff commented 5 years ago

I am thinking that the problem could be multiple instances of VIC reading and writing from/to the same netCDFs!

thouska commented 5 years ago

Hi @iacopoff, ok that sounds like a very likely source for this error. Of course, under MPI it needs to be guranteed that every cpu-core uses different files. Is it possible to tell VIC to use different netCDFs? This would be the most efficient way. Otherwise, you might want to use different copies of VIC and access those different folders from each cpu. I provide one example where this copy is made, for each run, which does the trick, but is a bit hard drive demanding. Better solution would be to do the n copies of your VIC model, where n is the number of CPU cores.

iacopoff commented 5 years ago

Hi, I have tried again following your example:

Under the simulation method I am creating a folder for each CPU (if I interpreted correctly your example), and then I am copying all the input files in the new folder.

def simulation(self,vector):
     if self.parallel == 'mpi':
            call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2)
            parall_dir = self.DirFolder + "/run_" + call
            copytree( self.DirFolder, parall_dir,ignore=ignore_patterns('*.csv', 'tests','*.py'))

then I change the working directory, rename files (they get a new reference given by the "call" and "parall_dir" arguments) and updating the config file with the correct references to the new files. and run self.vic.model.run()

os.chdir(parall_dir)

...
  rename files
...

 simulations = self.vicmodel.run_vic(dir_path = parall_dir,global_file= globalFile_new ,par_file=paramFile_new,
                                        binfilt=vector[0],Ws=vector[1],Ds=vector[2],c=vector[3],soil_2d=vector[4],soil_3d=vector[5])

within the self.vicmodel.run() (that should be now within the new folder owned by a CPU) I am:

updating the netcdf parameter file
running VIC
running the routing exe

does it sound correct? I have run the code above from the terminal using 2 CPUs:

mpiexec -n 2 python vic_calibration_multi.py

apart from the same error about forking, I get only 1 new folder and within that the input correctly renamed and the config file with the correct references. However the process just hang there. so why only 1 folder?

if I run with 4 CPUs:

mpiexec -n 4 python vic_calibration_multi.py

I get new folders within other folders.

Do you have an idea of what is going on? I am about to give up :)

thanks!

iacopo

thouska commented 5 years ago

Hi @iacopoff, if you use parallel computing, one cpu will selcted to handle all the result collection, writing spotpy files, etc. (the so called master). All other available cpu will be used to start whatever is happening inside of "def simulation". These cpu's are the slaves. So in your case with n=2 cpu's you will get only one folder. So that makes sense so far. That you get "new folders within other folders" indicates thats probably something is messed up with the use of os.chdir(...). You need to make sure that you alway return to your working directory. This is done in the example by: os.chdir(self.hymod_path+call) and os.chdir(self.curdir)

iacopoff commented 5 years ago

thanks @thouska and @philippkraft for your help. I think I have almost resolved the issue, the IS team wrongly compiled the various library upon which VIC depends (netcdf, hdf5, openmpi...). I have recompiled everything although now I have to switch to other projects for some time, when I will come back to this i will post here any solution I have found. thanks

iacopo

iacopoff commented 5 years ago

Hi, OK VIC is running now.

I have found that openMPI has issues when Python is calling a fork with os.system or subprocess.. https://www.open-mpi.org/faq/?category=tuning#fork-warning https://opus.nci.org.au/pages/viewpage.action?pageId=13141354

and this is the reference to the workaround I have applied:

https://github.com/open-mpi/ompi/issues/3158

iacopoff commented 5 years ago

I have another issue, unfortunately.

It seems that the master is not always aware whether a process in a worker has finished or not, so that it starts doing its stuff before it receives a message from a worker that has completed its job.

I have added a printout from a calibration test run, that has these specs

it uses SCE-UA
it runs with just three cores
SPOTPY version is 1.5.2

Basically it is printing when a worker or the master is doing something (initializing SPOTPY instance class, generating parameters, getting observation, simulating and calculating the obj functions) with a reference on the process rank (so 0 = master, > 0 workers). my comments are in bold..

`mpiexec -np 3 python vic_cal_spotpy_parallel.py

** START CALIBRATION ** START CALIBRATION ** START CALIBRATION i am INITIALISING spotpy setup in process rank: 2 i am GENERATING PARAMETERS in process rank: 2 i am INITIALISING spotpy setup in process rank: 0 i am GENERATING PARAMETERS in process rank: 0 i am reading OBSERVATION in process rank: 2 i am reading OBSERVATION in process rank: 0 i am INITIALISING spotpy setup in process rank: 1 i am GENERATING PARAMETERS in process rank: 1 i am reading OBSERVATION in process rank: 1 ----- finished reading obervation in process rank: 0 observation file read?: True ----- finished reading obervation in process rank: 2 observation file read?: True ----- finished reading obervation in process rank: 1 observation file read?: True Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions The objective function will be minimized Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions The objective function will be minimized Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions The objective function will be minimized i am GENERATING PARAMETERS in process rank: 0 Starting burn-in sampling... i am RUNNING SIMULATION in process rank: 1 # here it starts simulation in worker 1 i am RUNNING SIMULATION in process rank: 2 # here it starts simulation in worker 2 executing VIC model... executing VIC model... executing routing... executing routing... ----- finished simulation in process rank: 1, simulation file produced?: True # here it finishes simulation in worker 1 calc OBJECT FUNCTION in process rank: 0, simulation produced?: True observation produced?: True # and here the master is calculating the obj function from results of worker 1, everything is fine.. obj function result: -0.28935350204454646 1 of 100, minimal objective function=-0.289354, time remaining: 00:16:10 Initialize database... ['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData'] Database file '/projects/mas1261/wp3/VIC/ba_bsn_025/run_test1/cal_kge_log_multiprocess.csv' created. i am RUNNING SIMULATION in process rank: 1
*** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet.** Traceback (most recent call last): File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 412, in getfitness return self.setup.objectivefunction(evaluation=self.evaluation, simulation=simulation, params = (params,self.parnames)) TypeError: objectivefunction() got an unexpected keyword argument 'params'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 265, in startTime,endTime,timeStep,params,weights,cal_output_name,rep=n_of_runs) File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 232, in calibrate sampler.sample(rep, ngs=10,kstop=100,pcento=0.0001,peps=0.0001) File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/sceua.py", line 189, in sample like = self.postprocessing(icall, randompar, simulations,chains=0) File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 390, in postprocessing like = self.getfitness(simulation=simulation, params=params) File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 416, in getfitness return self.setup.objectivefunction(evaluation=self.evaluation, simulation=simulation) File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 213, in objectivefunction objectivefunction = -spotpy.objectivefunctions.kge(np.log(evaluation+1), np.log(simulation+1)) TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' ----- finished simulation in process rank: 2, simulation file produced?: True # here the worker 2 has finished the simulation, too late!
`

I hope you can read the text and that it does make sense what I have reproduced here and what is the issue I am encountering.

The master is looking for the output simulation from worker 2, before it has finished. In fact the master fails because it reads a NoneType.

Do you have any idea of what could cause that?

thanks!

thouska commented 5 years ago

Hi @iacopoff, thank you for your detailed error description. We will have a closer look on the mpi implementation of sce-ua in the upcoming days, to check for any new potential errors. Meanwhile you might want to check two things:

1) Start another sampler (e.g. Monte Carlo) to test, wheter you problem persist. In this sampler the mpi implementation is much more easier, i.e. less potential sources of errors and easier task to solve them.

2) On the cluster computer I am using, printed messages come not always in the correct order in how they where produced. Therefore you might get the message:

***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet.

before

----- finished simulation in process rank: 2, simulation file produced?: True # here the worker 2 has finished the simulation, too late!

You can check this by including e.g. somehting like this:

import time

# Wait for 5 seconds
time.sleep(5)

PS: The issue #226 might or might not be linked to this issue here

iacopoff commented 5 years ago

Hi @thouska, I understand that printing on screen is not a great diagnostic tool to infer the temporal execution of the underling code when using multi processes. However, when the process from the master hits the objectivefunction method, which is what the text below represent:

***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet

i guess it expects an array of simulation results from the simulation method of the worker 2. Which it does not find and therefore it crashes.

If it was only a problem of printing delay on the screen then it would not crash. Or am i totally wrong?

I have tried with MC but I got the same error. And as well the time.sleep() does not resolve the issue on the sequence of printing on screen.

thanks a lot for your time!

thouska commented 5 years ago

Have you tested our example script on your cluster computer? If it runs, the problem is located in the starting routine of the VIC model and if it is not running, it would be a general MPI problem.

iacopoff commented 5 years ago

Hi, thanks. Which kind of executable is HYMODsilent.exe? I cannot execute it.. I mean are you using windows? thanks iacopo

thouska commented 5 years ago

Hi @iacopoff, sorry for the late response, indeed unfortunatelly our Hymod executable is only for windows. We are working now on a unix standalone version in #230. I will keep you updated, as soon as this is finished.

iacopoff commented 5 years ago

Hi @thouska any progress with this? i might compile it myself?

thanks

thouska commented 5 years ago

Hi @iacopoff, yes @bees4ever was working on this to proivde a hymod version for unix. I just had no time to review it ( #231). I did this now and tested it on my local unix machine, there it worked perfectly by using: mpirun -c 4 python3 tutorial_parallel_computing_hymod.py The corresponsing spotpy_setup class can be found in spotpy/examples/spot_setup_hymod_unix.py Which is using a standalone version of hymod under unix spotpy\examples\hymod_unix. I merged the pull_request #231 and uploaded the new coresponding spotpy version (1.5.6) on pypi.

Would you like to test, wheter this works on your machine and if your reprted issue reamins? Than we would have a good basis to search for the reasons of this issue.

iacopoff commented 5 years ago

Hi @thouska , thanks a lot!

I have tested today and it did not work in both hymod_3.6 and hymod_3.7:

/hymod_3.6: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

./hymod_3.7: /lib64/libc.so.6: versionGLIBC_2.26' not found (required by ./hymod_3.7)`

I have then compiled hymod myself (thanks for the *.sh file!) and this is the terminal output:

Hymod reads in from file called 'Param.in'
NameError: name '__file__' is not defined

I have seen that in hymod.pyx there is a file call... maybe it is related to this: https://stackoverflow.com/questions/51733318/cython-cannot-find-file

The important thing is that i did not get openmpi errors of any sort so far.

thanks

bees4ever commented 5 years ago

Hi @iacopoff,

nice, the *.sh runs. 😄. However, did the ./hymod call produce a Q.out file?

The other question is, which Linux do you use? Maybe I can try to reproduce the error. Otherwise I can try to work around the need of having the __file__ constant.

Regards

iacopoff commented 5 years ago

Hi @bees4ever, thanks.

it does not produce the Q.out file actually.

Was it working for you with the file constant ? If you don't have time I am happy to try myself, maybe do you have already some hints?!

some info on the Linux os i am using: ` cat /etc/os-release

NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

uname -a

Linux mgmt01 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

lscpu

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 44 Model name: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz Stepping: 2 CPU MHz: 2394.005 BogoMIPS: 4788.01 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0-11

`

cheers!

bees4ever commented 5 years ago

Hi @iacopoff,

yes for me it worked with the __file__ constant. I worked around now, to not use it anymore. if you have some time you may can apply the patch https://patch-diff.githubusercontent.com/raw/thouska/spotpy/pull/235.patch and test if you now can successfully build the hymod exe...

Thanky

bees4ever commented 5 years ago

@iacopoff I just got the idea, I provide a zip here, so you can test much easyier hymod_cython.zip

iacopoff commented 5 years ago

@bees4ever, thanks everything worked well!

Also VIC now runs fine on many cores and results look good, even if i still get the error about forking. I guess that at this point I should just ignore it!

There is one point it might worth noting: in the script https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_unix.py I had to move this line call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2) under the spotpy init method. Otherwise it was not working.

thanks for your patience and your help!

iacopo

bees4ever commented 5 years ago

@iacopoff thanks for your feedback. About this move of line, @thouska may can look into this...

thouska commented 5 years ago

Hi @iacopoff, do you need the cpu_id (named as call) set in the line: call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2) If so, you should not move it to init as this is called only once at the begining of the sampling. The idea was to use this line in order to have a inidividual output filename, or generate individual model folders (one for each cpu core). If this is not done, different cpu cores might/will access/write on the same file on the hard drive, which will mess up the results. If you solve this problem somehow differently, you can ignore the call line. However, for the Hymod_exe example it is definetly needed. @bees4ever: How did you solve this in the unix variant, is the model still writing a Q.out file? If so, we still need the call line.

bees4ever commented 5 years ago

@thouska yes the hymod exe for unix is still writing a Q.out file

thouska commented 5 years ago

@bees4ever thanks for the quick answer! @iacopoff this means that you need this call line in the def simulation part. What kind of error message do you get there? However, as mentioned above, you can also go for another way to get indiviual model folders, like a very exact time stamp as string or a random number out of a high range.

iacopoff commented 5 years ago

@bees4ever and @thouska, sorry maybe I was not clear:

In the hymod exe for unix calibration the call line under the def simulation part works fine.

In VIC calibration i had to move it into the init, but then using a self.call class variable. i may share with you the code i wrote sometime in the future if you like as example.

thanks again

thouska commented 4 years ago

I hope this issue is mainly solved. @iacopoff if you want to share your implementation, I would be very interessted to see it. If anyting is still not working, feel free to reopen this issue.

thouska / spotpy

correct way to spawn a subprocess when running SCE-UA using multiprocess mpi #222