pylada / pylada-light

A physics computational framework for python and ipython
GNU General Public License v3.0
38 stars 24 forks source link

Hostfile does not work with openmpi #25

Closed ftherrien closed 6 years ago

ftherrien commented 6 years ago

https://github.com/pylada/pylada-light/blob/7e78d8f16304b932f792befa513443caef0ecf35/process/mpi.py#L177-L179

The default behavior of mpirun in openmpi is to assume 1 slot(core) per host, which makes any call to mpirun fail when more than one core is used. This can be solved by uncommenting L177 and removing L178-L179.

Why was it commented in the first place? The current version works with intel mpi, but uncommenting L177 works with both intel mpi and openmpi so why loose the generality by commenting it?

Also, on the bigger picture, what is the advantage of writing the hostfile? The job scheduler writes it automatically if it is not specified. As much as I can tell, not having to check the hosts manually would free pylada from the mpi4py depdendency.

mdavezac commented 6 years ago

The advantage is that by specifying the hosts, we can split the same pbs/slurm job between, say, two different vasps runs. This is advantageous when running stuff like genetic algorithms. However, not many people use this as it is quite brittle and machine dependent. There is an option that disables the feature, in which case we really shouldn't be using the hostfile at all, as you point out. I'll give it a look.

mdavezac commented 6 years ago

Okay, now I remember how this is meant to work. Unfortunately, some of this stuff is machine dependent, so it has to be specified by the user.

I'm assuming that you are using a modern cluster where mpirun knows how to do things automatically.

Then in your ~/.pylada, you want to add the following:

do_multiple_mpi_programs=False

mpirun_exe = "mpirun {program}"

def machine_dependent_call_modifier(formatter=None, comm=None, env=None):
    pass
ftherrien commented 6 years ago

This look like a good solution to me!

  1. Should we make this the default behavior?
  2. Coming back to L177 form my first post. Isn't uncommenting that line the most general way then? Doesn't it make the code machine independant?

In my opinion, we should uncomment L177 AND make your solution the default behavior. I can make the changes (and do a pull request this time!) if you agree.

mdavezac commented 6 years ago

I think the format for the host-file depends on whether you are using intel mpi, openmpi, or mpich. That's why I'm not too enthusiastic about uncommenting line 177. Maybe we should have specialized functions for each of the three hostfiles formats. Also, the default behavior should probably be not to write a hostfile, since I don't think anybody uses the ability to run several jobs in the same cluster submission script. As for the mpirun_exe, I'll have to check. It might be specifying flags that are not always necessary makes it more general.

ftherrien commented 6 years ago

Like you said, the real solution is to have specialized functions for each of the three hostfile (Or 2 of them at least, because intel mpi and open mpi have the same format, just not the same default number of slots). Does pylada already check the flavour of mpi? If not, it could just be user defined.

the default behavior should probably be not to write a hostfile

Yes, I agree, and then if a user wants a hostfile they would have to deal with the formating and not the other way around.