stivalaa / culture_cooperation

Culture and cooperation in a spatial public goods game
GNU General Public License v3.0
3 stars 1 forks source link

When I use smaller lattice and parameter set for testing. No result too. #4

Closed Frostjon closed 5 years ago

Frostjon commented 5 years ago

When I use smaller lattice and parameter set for testing. The program seems to stuck too. I used this command "mpirun --mca mpi_warn_on_fork 0 python ./lattice-python-mpi/src/axelrod/geo/expphysicstimeline/multiruninitmain.py m:4 F:5 strategy_update_rule:fermi culture_update_rule:fermi ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model 16",the echo is as follows:

[zh@localhost ~]$ mpirun --mca mpi_warn_on_fork 0 python ./lattice-python-mpi/src/axelrod/geo/expphysicstimeline/multiruninitmain.py m:4 F:5 strategy_update_rule:fermi culture_update_rule:fermi ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model 16 Psyco not installed or failed execution. Using c++ version with ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model Psyco not installed or failed execution. Using c++ version with ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model Psyco not installed or failed execution. Using c++ version with ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model Psyco not installed or failed execution. Using c++ version with ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model Clean start Writing results to results/16/results3.csv 700 of total 700 models to run 175 models per MPI task time series: writing total 70700 time step records rank 3: 16,30,10.000000,1.000000,0.000000,None,2,5,0.600000,0.000100,0 Clean start Writing results to results/16/results2.csv 700 of total 700 models to run 175 models per MPI task time series: writing total 70700 time step records rank 2: 16,30,10.000000,1.000000,0.000000,None,2,5,0.600000,0.000010,0 writeNetwork: 0.00170707702637 writeNetwork: 0.00317406654358 writeNetwork: 0.00361394882202 writeNetwork: 0.0104489326477 Clean start Writing results to results/16/results0.csv 700 of total 700 models to run 175 models per MPI task time series: writing total 70700 time step records rank 0: 16,30,10.000000,1.000000,0.000000,None,2,5,0.600000,0.000000,0 writeNetwork: 0.00322198867798 writeNetwork: 0.0029091835022 Clean start Writing results to results/16/results1.csv 700 of total 700 models to run 175 models per MPI task time series: writing total 70700 time step records rank 1: 16,30,10.000000,1.000000,0.000000,None,2,5,0.600000,0.000001,0 writeNetwork: 0.00337886810303 writeNetwork: 0.00154304504395

And what kind of job monitoring system should I use? Is this one "https://github.com/nicolargo/glances" fine?

Thanks a lot!

stivalaa commented 5 years ago

Well I don't know what is going wrong then, if it doesn't even run for a very small example like that.

I don't know about that job monitoring system, usually there is one already available for whatever system you are using accessible on its internal webpage for management. Or what i used to do is just to ssh to the specific node your job is on and use ps and top.

But perhaps you need to test by just running a single task in case by starting an interactive session to check that works in case there is some python MPI problem.

Frostjon commented 5 years ago

Thank you for your reply,I followed your advice to try some task about python MPI problem, after that I do not think there is some problem about it. Now, I think if there is some software configure error. I follow all the software version above your specify, except python-igraph version, because i can not get 0.6. The follows are my pip version: Package Version


mpi4py 1.3.1
numpy 1.9.1
pip 9.0.1
python-igraph 0.7.1.post6 scipy 0.14.1
setuptools 0.6rc11

And i run it on Red Hat Enterprise Linux Server release 6.2 (Santiago).

stivalaa commented 5 years ago

So it is probably something to do with the setup of MPI and mpi4py perhaps? (I doubt the igraph version is a problem, if it works without MPI in an interactive session). I think mpi4py has some test scripts, you should maybe try those and see if it works.

Frostjon commented 5 years ago

I just run this scrips to test, and it works well:

import mpi4py.MPI as MPI

comm = MPI.COMM_WORLD comm_rank = comm.Get_rank() comm_size = comm.Get_size()

if comm_rank == 0: data = range(comm_size) print data else: data = None local_data = comm.scatter(data, root=0) local_data = local_data * 2 print 'rank %d, got and do:' % comm_rank print local_data combine_data = comm.gather(local_data,root=0) if comm_rank == 0: print combine_data

Is that can make sure, it is not the MPI and mpi4py problems?

stivalaa commented 5 years ago

Well if the mpi4py tests pass, then I guess it probably isn't that. I guess you will just have to run the program withuot paralleization just on a single node without using MPI, since it worked that way?

Frostjon commented 5 years ago

Is that means I could just run this command "python ./lattice-python-mpi/src/axelrod/geo/expphysicstimeline/multiruninitmain.py m:4 F:5 strategy_update_rule:fermi culture_update_rule:fermi ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model 16 "? However it seems to stuck too, it shows:


[zh@localhost ~]$ python ./lattice-python-mpi/src/axelrod/geo/expphysicstimeline/multiruninitmain.py m:4 F:5 strategy_update_rule:fermi culture_update_rule:fermi ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model 16 Psyco not installed or failed execution. Using c++ version with ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model Clean start Writing results to results/16/results0.csv 700 of total 700 models to run 700 models per MPI task time series: writing total 70700 time step records rank 0: 16,30,10.000000,1.000000,0.000000,None,2,5,0.600000,0.000000,0 writeNetwork: 0.00998592376709 writeNetwork: 0.00387692451477


Could you show me the result of this command in your pc? And how long it probably take? Thanks very much!

stivalaa commented 5 years ago

It will still take a long time (probably weeks) because tmax where it is set in main is still at the default value of 1e09. If you change it to e.g. 1000 it takes about 1 minute on this example on the system I am using. But it has to be large enough to reach stochastic equilibrium which you can only tell from trying.

Frostjon commented 5 years ago

Thanks very much. This helps me a lot !

Frostjon commented 5 years ago

What's more I want to ask ,if I can save more time by improving the CPUs configure.

stivalaa commented 5 years ago

The python script just generates a list of all the different sets of parameters to run with, and does 50 repititions of each parameter set, then it distributes those to as many MPI tasks as you have in your job, so if you use 10 tasks then it is pretty much 10 times faster than using just 1 task for example (although it has to take as long as the longest run at least).

Frostjon commented 5 years ago

Thanks a lot, when I used tmax=1e08, the project took about 7 days, after that I used the "plotMaxRegionVsQend.R" of the scripts, I got some figures, for example,the figure of "Average number of players per game", I could see the picture has the same trend of that in your picture, but it does not as wonderful as yours. Here is my results data and some figures https://github.com/Frostjon/the-culutre.I wonder if it is the tmax=1e08(not 1e09) makes this result?

Frostjon commented 5 years ago

And I found that there are two values of q in results.csv, so how does the figures take different values?Thanks!

stivalaa commented 5 years ago

The different values of q for the simulations are listed in the q_list in multiruninitmain.py main function (around line 600 and following). Similarly for theta_list, noise_list, etc. So you should carefully set these to choose what simulations to run - and it can result in a very large number since all the different values are done in nested loops e.g. for every value of q in q_list it uses then every value of theta in theta_list, etc.

These all end up in results.csv and then the R scripts plot them using the different values as factors (or continous values as appropriate).

Probably differences are due to having smaller tmax. You can plot with time on the x axis with plotMaxRegionVsTime.R etc. to check that it is reaching stochastic equilibrium. Note than when there is no noise you can run until it reaches a fixed point (absorbing state) but when there is any noise then it is stochastic so there is no absorbing state, changes can always happen.

Frostjon commented 5 years ago

Thanks a lot . Now I realized if I want to draw the FIG.1 in your paper("Culture and cooperation in a spatial public goods game") . I should reset the q_list=[1,2,3...,100] in multiruninitmain.py. Is that right?

Frostjon commented 5 years ago

What's more, is "tmax" parameter means iterations? And can I set the "runs" parameter for a low value, such as 20, is it has other influence?

stivalaa commented 5 years ago

Yes, Fig. 1 has different values of q from 1 up to 100 so you would need them in q_list. tmax is the number of iterations. The runs parameter is the number of times the model is run with the same set of parameters, so that the variance of the resulting statistics can be computed to get the error bars on the plots, so it could be reduced to 20 for example rather than the 50 used in the paper if you want to use less resources .

Frostjon commented 5 years ago

Thank you very much, and I want to ask is there any other parameter that can reduce, if I just want to get Fig.1.[2.3]. My pc is not very well, I try to build a small cluster recently.

stivalaa commented 5 years ago

You could try setting runs to 1 to just get single data points without any idea of variance (so no error bars) but of course then that is quite different from the published figures. Also since noise = 0 is qualitatively different from any nonzero noise, if you are only interested in one case you cat just set noise_list = [0] for zero noise or noise_list = [1e-06] for example for just one case of nonzero noise.

Frostjon commented 5 years ago

Yes, when I use this command "mpirun -n 20 -hostfile servers --mca mpi_warn_on_fork 0 python ./lattice-python-mpi/src/axelrod/geo/expphysicstimeline/multiruninitmain.py m:100 F:5 strategy_update_rule:fermi culture_update_rule:fermi ./lattice-jointactivity-simcoop-social-noise-constantmpcr-cpp-end/model 10000", I realized the -n parameter is relate to runs parameter, does this mean I cannot use this command to accelerate the running of the program if I just want to get single date points of variance(I cannot use mpi to accelerate the program if I just want to get single date points of variance?)

stivalaa commented 5 years ago

Yes, this version was optimized (as per the comments in the top) so that you want the same number of tasks as runs, since then the repetitions of the same run can be run in parallel as they are likely to take (approximately) the same time so this maximizes the use of the parallel tasks - and it is important for the results (especially for publication) to get the error bars so I always wanted at least 10 (and preferably) more repetitinos of each parameter set. There is another versino of the code (main.py instead of multiruninitmain.py) that did not do this so you could parallelize even a single run but it doesn't seem to be included in the code i uploaded to git and I'm not sure I kept it up to date with all the code changes I made in multiruninitmain.py so it probably no longer works (even if I could find the latest version).

Frostjon commented 5 years ago

Well, when I use the parameter "q_list = [2,10,20,30,35,40,45,50,52,55,58,60,65,68,70,73,75,80,85,90,95,100] ,noise_list = [ 0, 1e-05, 1e-03, 1e-01 ],tmax = 1000000000 # 1e09",the project runs for about a month and still not stop,I give 8 cpus (and each one have 2 cores, the cpu is E5-2620*2),is that configuration too low? And I want to ask how long it takes in your computer?

stivalaa commented 5 years ago

Yes, I'm afraid that's about how long you would expect. I restored some of the raw results from my archive hard drive and checked, for example with q_list = 2,5,10,15,30,100 and tmax = 1e09 and m=100 (so n = 10000) and noise_list = 0,1e-06,1e-05,0.0001,0.001,0.01,0.1 it took about 2 weeks elapsed time with 50 cores (one per repeated run on the same parameter set) on a Lenovo x86 cluster (992 Intel Haswell compute cores running at 2.3GHz). And that is just a fraction of the runs needed to make a single plot in the paper (many more values are q are included, so this is repeated several times with different q values to get them all).

Frostjon commented 5 years ago

Thanks for reply. I do not know that if you use psyco or not. I search something about psyco, and it seems that it supported up to python 2.6.

stivalaa commented 5 years ago

No, I never actually used it. That was left over from the original author of the code (Jens) that I started with. It wouldn't make much difference as I redid the Python code to be more efficient, and the time it takes is trivial compared to the actual simulation which is done in the C++ code, if you are using something like 1e09 iterations that is what takes all the time.

stivalaa commented 5 years ago

closing