Open thomasgillis opened 7 months ago
@thomasgillis Thanks for the fix. I am able to build libfabric with cxi using your branch. But my application is failing at runtime with following error. (I am using sandia openSHMEM with libfabric and cxi as provider) [0000] WARN: transport_ofi.c:1420: query_for_fabric [0000] OFI transport did not find any valid fabric services (provider==cxi) [0000] ERROR: init.c:466: shmem_internal_heap_postinit [0000] Transport init failed (-61) Can you suggest the solution?
It seems to be a provider selection issue in openSHMEM, I am afraid I cannot help you here :) I would reach out to them directly
Copy/pasting my comment from https://github.com/ofiwg/libfabric/pull/9793#issuecomment-2075894886. We would really like to be able to build the cxi
provider on our production Slingshot systems. I'm not totally sure how we get there from here, but we may be able to utilize ALCF resources for CI.
FWIW, I've reached out to folks at ALCF to see if there's anything that can be done to support, at minimum, build testing of cxi on the Polaris machine here at Argonne. Ideally, once cxi is able to build on a production system, CI could prevent further breaking changes from going in. @jswaro is that something that would be of interest?
On perlmutter the configury does better than on systems with older sshot (pm has 2.1.2), but the configury fails with complaints about __user in a cxi related header file:
configure:35099: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure:35099: WARNING: cxi_prov_hw.h: check for missing prerequisite headers?
configure:35099: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure:35099: WARNING: cxi_prov_hw.h: section "Present But Cannot Be Compiled"
configure:35099: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure:35099: checking for cxi_prov_hw.h
configure:35099: result: no
configure:35108: checking uapi/misc/cxi.h usability
configure:35108: gcc -c -O2 -DNDEBUG -pipe -fvisibility=hidden -Wall -Wundef -Wpointer-arith conftest.c >&5
In file included from conftest.c:147:
/usr/include/uapi/misc/cxi.h:76:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
76 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:82:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
82 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:96:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
96 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:110:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
110 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:130:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
130 | void __user *resp;
| ^
/usr/include/uapi/misc/cxi.h:144:38: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
is this what you also see @raffenet
oh I'm on main at 717ebc5dcba5a41221374bdfd64b75efe20b5b05
is this what you also see @raffenet
I think @thomasgillis ran into this and ended up just adding
#define __user
somewhere to make that issue go away because its just a hint anyway.
Hi Thomas,
I tried by using your branch, but I see a weird behavior. It seems that CXI is properly linked:
🔥 [caubet_m@login001:~]# ldd $(which fi_info) | grep cxi
libcxi.so.1 => /usr/lib64/libcxi.so.1 (0x00007f61079c4000)
However, the provider is not listed. Here there's a shorter example by only using the CXI provider:
🔥 [caubet_m@login001:~]# export FI_PROVIDER=cxi
🔥 [caubet_m@login001:~]# fi_info
fi_getinfo: -61 (No data available)
We run cray-libcxi-0.9-SSHOT2.1.3_20240529150829_3d1dc9246116.x86_64
, I was wondering whether do/did you see a similar issue.
Hi all,
I am trying to build the cxi provider on LUMI. The update merged in #9791 breaks the build process because
lib-cxi
is too old. I am using here the main branch with the patch suggested in #9789:Here are the command used:
and the version of the relevant libs
I understand that the effort of open-sourcing
cxi
is tedious and that the versioning problem might not be resolved easily/quickly. This specific issue is intended to track the issues we currently face. In the mean time, I have reverted the changes, the branch is available here: https://github.com/thomasgillis/libfabric/tree/dev-cxi With the revert of the PR, the code compiles correctly on LUMI