pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
562 stars 279 forks source link

MPICH with NVIDIA Compilers #7178

Open aruhela opened 1 month ago

aruhela commented 1 month ago

Hi Mpich Team,

I have build MPICH with NVIDIA compilers (nvc, nvc++ nvfortran) on TACC Vista machine. Though srun works but mpiexec job launcher results in following errors. Any suggestions?

i615-001gg$ mpiexec -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd [proxy:3@i615-004.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_QmhOmh [proxy:5@i615-012.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_kYI4Ja [proxy:2@i615-003.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_7fPRik [proxy:1@i615-002.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_bjz7BQ [proxy:0@i615-001.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_LGXVSr [proxy:6@i615-013.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_4GtuuA [proxy:4@i615-011.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_ud3CVC [proxy:7@i615-014.vista.tacc.utexas.edu] created hwloc xml file /tmp/hydra_hwloc_xmlfile_uKHjRx [proxy:2@i615-003.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:4@i615-011.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:1@i615-002.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:6@i615-013.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:3@i615-004.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:5@i615-012.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:0@i615-001.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed [proxy:7@i615-014.vista.tacc.utexas.edu] cache_put_flush (proxy/pmip_pmi.c:183): assert (s) failed Abort(878831119) on node 2: Fatal error in internal_Init_thread: Other MPI error, error stack: internal_Init_thread(49255)...: MPI_Init_thread(argc=0xfffff342b99c, argv=0xfffff342b990, required=1, provided=0xfffff342b988) failed MPII_Init_thread(265).........: MPIR_init_comm_world(34)......: MPIR_Comm_commit(800).........: MPIR_Comm_commit_internal(585): MPID_Comm_commit_pre_hook(151): MPIDI_world_pre_init(640).....: MPIDI_UCX_init_world(263).....: initial_address_exchange(79)..: MPIDU_bc_table_create(153)....: MPIR_pmi_allgather_shm(690)...: get_ex_segs(431)..............: (unknown)(): Other MPI error

hzhou commented 1 month ago

Which version of MPICH is this? Could you try the latest release?

aruhela commented 1 month ago

Its the latest 4.2.3 version.

hzhou commented 1 month ago

Could you add -v -l option to mpiexec and upload the console log?

aruhela commented 1 month ago

Here is the log file, run.log

The main error is [mpiexec@i615-001.vista.tacc.utexas.edu] Launch arguments: /usr/bin/srun -N 8 -n 8 --input none --external-launcher /scratch/projects/compilers/nvidia24/mpich/4.2.3_cpu/bin/hydra_pmi_proxy --control-port i615-001.vista.tacc.utexas.edu:45341 --debug --rmk slurm --launcher slurm --demux poll --pgid 0 --retries 10 --usize -2 --pmi-port 0 --gpus-per-proc -2 --gpu-subdevs-per-proc -2 --proxy-id -1 [proxy:1@i615-002.vista.tacc.utexas.edu] HYDU_create_process (lib/utils/launch.c:73): execvp error on file 1 (No such file or directory)

hzhou commented 1 month ago

Could you try ? -

mpiexec -v -np 16 -ppn 2 ./namd3_mpi_smp_fftw3 +ppn 71 +pemap 1-71,73-143 +commap 0,72 stmv.namd
aruhela commented 1 month ago

Hui, here is the log.

run2.log