pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
563 stars 279 forks source link

UCX fails to initialize on ppc64le #7213

Open dalcinl opened 1 week ago

dalcinl commented 1 week ago

I'm having issues building MPICH 4.2.3 and 4.1.3 with external UCX 1.17.0 (+ fix from https://github.com/openucx/ucx/pull/9973) on ppc64le under emulation using podman. Builds on aarch64 and x86_64 are fine.

One of the build logs is here: https://github.com/mpi4py/mpi-publish/actions/runs/11803264738/job/32880846533. I can also reproduce the problem locally.

I'm configuring using --with-device=ch4:ofi,ucx. I run the basic MPI helloworld example setting MPICH_CH4_NETMOD=ucx. I'm getting the following failure:

Abort(135914895): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(49162).....: MPI_Init(argc=0x100000800470, argv=0x100000800478) failed
MPII_Init_thread(242)....: 
MPID_Init(552)...........: 
MPIDI_UCX_init_local(227):  ucx function returned with failed status(ucx_init.c 227 MPIDI_UCX_init_local Invalid parameter)

IIRC, our attempts to build MPICH with UCX on conda-forge also faced runtime issues in ppc64le. Any tips on how to further debug this issue?

raffenet commented 1 week ago

Is there anything useful in the output if you set UCX_LOG_LEVEL=info? Unfortunately I'm unable to launch a ppc64le container on my M1 Macbook to debug interactively.

dalcinl commented 6 days ago

No, UCX_LOG_LEVEL=info produced no additional output. I'm building UCX with the configure-release script, I'll try again with a debug build.

dalcinl commented 6 days ago

Once again, a debug build did not produce any additional output 😞 .