pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
539 stars 280 forks source link

Need help:Raspberry Pi Cluster #5050

Closed titolagarto closed 3 years ago

titolagarto commented 3 years ago

i have ubunto desktop in master node and I configured everything but when I went to run a program in C these errors appeared.

Master Node: Raspberry pi 4 Model b+ Nodes: Raspberry pi 3 Model b+

I'm flowing this tutorial: https://help.ubuntu.com/community/MpichCluster

master@master-desktop:~$ mpiexec -n 5 -f machinefile 
[proxy:0:1@node1] version_fn (pm/pmiserv/pmip_utils.c:449): UI version string does not match proxy version
[proxy:0:1@node1] match_arg (utils/args/args.c:156): match handler returned error
[proxy:0:1@node1] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[proxy:0:1@node1] parse_exec_params (pm/pmiserv/pmip_cb.c:769): error parsing input array
[proxy:0:1@node1] procinfo (pm/pmiserv/pmip_cb.c:849): unable to parse argument list
[proxy:0:1@node1] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:882): error parsing process info
[proxy:0:1@node1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@node1] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[mpiexec@master-desktop] control_cb (pm/pmiserv/pmiserv_cb.c:208): assert (!closed) failed
[mpiexec@master-desktop] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@master-desktop] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[mpiexec@master-desktop] main (ui/mpich/mpiexec.c:336): process manager error waiting for `completion``
hzhou commented 3 years ago

Are the masters and nodes running the same version of MPICH?

titolagarto commented 3 years ago

now that I saw, no, the master is in version 3.3.2 and the nodes are in version 3.2 how can i update the nodes to the version 3.3.2?

hzhou commented 3 years ago

It will be much easier if you mount the shared folder on all nodes (including master) to the same path. For example, you may mount the shared folder to /mirror as well on master. Then you can install mpich to a path in /mirror. That way, you will always running the exact same binary on all nodes and master.

titolagarto commented 3 years ago

well now is giving me another error


===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 4197 RUNNING AT master
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
hzhou commented 3 years ago

One of your process had segfault. Does your program run without mpi, i.e. on a single process?

titolagarto commented 3 years ago

the output is:


pi@master:~/cluster_files $ ./mpi_hello 
Segmentation fault
pi@master:~/cluster_files $

the code is the file "hello.c" in the examples folder


/*
 * Copyright (C) by Argonne National Laboratory
 *     See COPYRIGHT in top-level directory
 */

#include <stdio.h>
#include "mpi.h"

int main(int argc, char *argv[])
{
    int rank;
    int size;

    MPI_Init(0, 0);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello world from process %d of %d\n", rank, size);
    MPI_Finalize();
    return 0;
}
hzhou commented 3 years ago

The program looks ok. Cound you try the latest mpich release?

titolagarto commented 3 years ago

how can i update mpich?

hzhou commented 3 years ago

It appears the stock mpich that came with raspberry pi is not good. You can download the latest version from https://www.mpich.org/downloads/. Once unpacked, you may build it as:

./configure --prefix=/mirror --with-device=ch3 --disable-fortran
make install

I suggest disable fortran to make it build a bit faster on the raspberry -- well it will still be pretty slow.

Make sure you add /mirror/bin to your PATH and /mirror/lib to your LD_LIBRARY_PATH, and you should be good to go.

I just tested it, it should work on your raspberry pi.

Oh, make sure to apt uninstall mpich to remove the non-working one.

titolagarto commented 3 years ago

ok just one more question, if i want to install on /usr/bin it´s just change in --prefix=?

titolagarto commented 3 years ago

i have installed on /usr for testing and runing on local with 4 cores is giving me this error:


pi@master:~/cluster_files $ mpiexec -n 4 -f machinefile ./mpi_hello

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 706 RUNNING AT master
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
pi@master:~/cluster_files $ mpiexec --version
HYDRA build details:
    Version:                                 3.4
    Release Date:                            Tue Jan  5 09:27:10 CST 2021
    CC:                              gcc    
    Configure options:                       '--disable-option-checking' '--prefix=NONE' '--with-device=ch3' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -I/home/pi/cluster_files/mpich-3.4/src/mpl/include -I/home/pi/cluster_files/mpich-3.4/src/mpl/include -I/home/pi/cluster_files/mpich-3.4/modules/yaksa/src/frontend/include -I/home/pi/cluster_files/mpich-3.4/modules/yaksa/src/frontend/include -I/home/pi/cluster_files/mpich-3.4/modules/json-c -I/home/pi/cluster_files/mpich-3.4/modules/json-c -D_REENTRANT -I/home/pi/cluster_files/mpich-3.4/src/mpi/romio/include' 'MPLLIBNAME=mpl'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select
pi@master:~/cluster_files $ 

now the version is 3.4

and when i try to run with the other nodes returns this:


pi@master:~/cluster_files $ mpiexec -n 8 -f machinefile ./mpi_hello

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 18179 RUNNING AT master
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
bash: /usr/local/bin/hydra_pmi_proxy: No such file or directory
^C[mpiexec@master] Sending Ctrl-C to processes as requested
[mpiexec@master] Press Ctrl-C again to force abort
[mpiexec@master] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)
[mpiexec@master] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec@master] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream
[mpiexec@master] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@master] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec@master] main (ui/mpich/mpiexec.c:326): process manager error waiting for completion
titolagarto commented 3 years ago

A clean install of raspbian will help?

hzhou commented 3 years ago

You should confirm that you can run single process, i.e.mpirun -n 1 ./mpi_hello first.

Your mpi_hello may still be linked with old binary, you should recompile with your newly installed mpicc.

It is advisable to install mpi to a shared drive so the same binary and same path automatically works for all nodes.

titolagarto commented 3 years ago

i compiled again and run on local with 4 cores it runs:

pi@master:~/cluster_files $ mpiexec -n 4 -f machinefile ./mpi_hello
Hello world from process 2 of 4
Hello world from process 0 of 4
Hello world from process 3 of 4
Hello world from process 1 of 4
pi@master:~/cluster_files $ 

But when i run with multiple nodes returns me this error:

pi@master:~/cluster_files $ mpiexec -n 6 -f machinefile ./mpi_hello
bash: /usr/local/bin/hydra_pmi_proxy: No such file or directory
^C[mpiexec@master] Sending Ctrl-C to processes as requested
[mpiexec@master] Press Ctrl-C again to force abort
[mpiexec@master] HYDU_sock_write (utils/sock/sock.c:254): write error (Bad file descriptor)
[mpiexec@master] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:176): unable to write data to proxy
[mpiexec@master] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:42): unable to send signal downstream
[mpiexec@master] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@master] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:160): error waiting for event
[mpiexec@master] main (ui/mpich/mpiexec.c:326): process manager error waiting for completion

the error is from master or from the nodes?

hzhou commented 3 years ago

But when i run with multiple nodes returns me this error:

pi@master:~/cluster_files $ mpiexec -n 6 -f machinefile ./mpi_hello
bash: /usr/local/bin/hydra_pmi_proxy: No such file or directory

You are having inconsistent path between master and nodes.

titolagarto commented 3 years ago

so its the connection between the nodes and the master?

hzhou commented 3 years ago

The current design require mpich to be installed at the exact same path across all nodes. Not only that, it requires the application that it launches have exact same path.

That's why I commented that it is much easier to just install to the shared drive.

hzhou commented 3 years ago

@titolagarto I am closing this ticket. If you still need help, please feel free to re-open it and start a new ticket.