Closed andxeg closed 5 years ago
Can you try running a non-MPI program and see what happens? That should determine if mpiexec
is able to reach and launch binaries on all machines. For example:
mpiexec -np N -hosts master,slave hostname
@raffenet Thank you for fast reply.
I checked you command. Indeed mpiexec
write to stdout
master
master
and then hang for indefinite time.
I also check this command
# strace mpiexec.mpich -np 4 -hosts master,slave hostname
It returned
fcntl(1, F_SETFL, O_RDWR|O_APPEND|O_NONBLOCK|O_LARGEFILE) = 0
read(8, "", 65536) = 0
close(8) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=0, events=POLLIN}], 5, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12442, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
restart_syscall(<... resuming interrupted poll ...>) = 1
read(13, "Permission denied, please try ag"..., 65536) = 38
write(2, "Permission denied, please try ag"..., 38Permission denied, please try again.
) = 38
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=0, events=POLLIN}], 5, -1) = 1 ([{fd=13, revents=POLLIN}])
read(13, "Permission denied, please try ag"..., 65536) = 38
write(2, "Permission denied, please try ag"..., 38Permission denied, please try again.
) = 38
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=0, events=POLLIN}], 5, -1) = 1 ([{fd=13, revents=POLLIN}])
read(13, "Permission denied (publickey,pas"..., 65536) = 41
write(2, "Permission denied (publickey,pas"..., 41Permission denied (publickey,password).
) = 41
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=11, events=POLLIN}, {fd=13, events=POLLIN}, {fd=0, events=POLLIN}], 5, -1) = 1 ([{fd=13, revents=POLLHUP}])
read(13, "", 65536) = 0
close(13) = 0
poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=11, events=POLLIN}, {fd=0, events=POLLIN}], 4, -1) = 1 ([{fd=11, revents=POLLHUP}])
read(11, "", 65536) = 0
close(11)
I can ssh from one VM to another without password.
Are both hostnames set to master
? That might cause issues during MPI_Init
.
One VM has hostname 'master', another 'slave'. I also add their IP and hostnames to /etc/hosts. In case above I run 4 MPI processes, therefore stdout has 2 lines with master, then execution was freezed, because there is some problem in slave.
You may need to allow additional TCP communication in the firewall rules. From the sound of your experiments, mpiexec
is unable to establish a connection to the agent it launched on slave
via SSH. For reference, there is an environment variable you can set to specify the allowed port range. See https://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Environment_Settings
I can ssh from master to slave and from slave to master without password:
mpiuser@master$ ssh slave
mpiuser@slave$ ssh master
In iptables INPUT, OUTPUT and FORWARD chains have default ACCEPT policy.
I set env variable MPIEXEC_PORT_RANGE to 10000:10010 and start mpiexec
on master:
mpiuser@master$ mpiexec.mpich -np 2 -hosts master,slave hostname
Output:
master
and execution was freezed because waiting slave node.
On slave node output of command mpiuser@slave$ ps -ef | grep hydra
:
mpiuser 15439 15438 0 07:04 ? 00:00:00 /usr/bin/hydra_pmi_proxy --control-port master:10000 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
I check port state (open/close):
mpiuser@slave1$ nc -zv master 10000
Connection to master 10000 port [tcp/webmin] succeeded!
mpiuser@master$ mpiexec.mpich -np 2 -hosts master,slave hostname
Can you add -v
to this command and paste the output?
@raffenet Thank You very much!
I realized the reason of problem. Between two my servers I created VXLAN tunnel.
Packets from one server to another go through the cisco router and it dropped them. I didn't consider it.
When I had set MTU to 1450 instead 1500 on interfaces in virtual machines, mpirun
command finished perfectly.
Hello everyone.
I have a problem with creation of virtual MPI cluster. I have 2 physical servers, on each I start one virtual machine using
virt-install
. Between servers I made VXLAN tunnel. Virtual machines can communicate between each other: ssh, ping and tcp (check by simple python client/server scripts).But
mpirun -np N -hosts master,slave ./test_mpi_prog
didn't work - infinite freeze. I checkedebtables
andiptables
, removed rules fromINPUT
,OUTPUT
andFORWARD
.I add
master
,slave1
to/etc/hosts
of each virtual machine, made ssh connection by key (without password). All work fine, because I check it on one server: start two virtual machines which were added tovirbr0
bridge, assigned IP addresses. This one-server cluster work. The commandmpirun -np N -hosts master,slave ./test_mpi_prog
was finished. Test programtest_mpi_prog
print proc name (usingMPI_Get_processor_name function
).Virtual machines has the same OS: ubuntu 16.04.
I configure virtual machine by instruction from -> https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/
MPICH version (
mpirun --version
).