openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 419 forks source link

Compile error with ucg master. #5833

Open xkszltl opened 3 years ago

xkszltl commented 3 years ago

Describe the bug

Version: ucx 1.9.0 + ucg master

  CC       plan/libucg_builtin_la-builtin_pairwise.lo
ops/builtin_control.c: In function 'ucg_builtin_step_zcopy_prep':
ops/builtin_control.c:245:34: error: assignment to 'uct_completion_callback_t' {aka 'void (*)(struct uct_completion *, enum <anonymous>)'} from incompatible pointer type 'void (*)(uct_completion_t *)' {aka 'void (*)(struct uct_completion *)'} [-Werror=incompatible-pointer-types]
      step->zcopy.zcomp.comp.func  = ucg_builtin_step_am_zcopy_comp_step_check_cb;

cc1: all warnings being treated as errors
make[3]: *** [ops/libucg_builtin_la-builtin_control.lo] Error 1
Makefile:618: recipe for target 'ops/libucg_builtin_la-builtin_control.lo' failed

Setup and versions

xkszltl commented 3 years ago

@alex--m FYI

xkszltl commented 3 years ago

The log says uct_completion_callback_t has no status argument so there may be mismatch somewhere:

image
alex--m commented 3 years ago

I'm on it. Some details on the problem: Change-ID 8da6a5be2e hanged UCT's API:

 After:  typedef void (*uct_completion_callback_t)(uct_completion_t *self);
 Before: typedef void (*uct_completion_callback_t)(uct_completion_t *self,
                                                   ucs_status_t status);

For this reason, I added a check which means the code now looks like this:

static void ucg_builtin_step_am_zcopy_comp_step_check_cb(uct_completion_t *self
#ifdef HAVE_UCT_COMP_CB_STATUS_ARG
                                                       , ucs_status_t status)
#else
                                                        )
#endif
{
...

Looks like the check went wrong - I'm checking as to why this happened, and I'll fix asap.

alex--m commented 3 years ago

I checked the v1.9.0 tag (cd9efd3d80ec3e5df4eadd005b9cc38f6324919e) and it has the "Before" version, so It get uct_completion_callback_t has a status argument and UCG's master build succeeds. The config.log there shows:

configure:18654: result: uct_completion_callback_t has a status argument
configure:18705: Building with Groups and collective operations support (UCG)

I also tested with UCX master, where there's the "after" version, and the config.log shows:

conftest.c: In function 'main':
conftest.c:40:36: error: too many arguments to function 'func'
                                    func(NULL, UCS_OK);
                                    ^~~~
configure:18660: $? = 1
configure: failed program was:
| /* confdefs.h */
| #define PACKAGE_NAME "ucx"
| #define PACKAGE_TARNAME "ucx"
| #define PACKAGE_VERSION "1.10"
| #define PACKAGE_STRING "ucx 1.10"
| #define PACKAGE_BUGREPORT ""
| #define PACKAGE_URL ""
| #define STDC_HEADERS 1
| #define HAVE_SYS_TYPES_H 1
| #define HAVE_SYS_STAT_H 1
| #define HAVE_STDLIB_H 1
| #define HAVE_STRING_H 1
| #define HAVE_MEMORY_H 1
| #define HAVE_STRINGS_H 1
| #define HAVE_INTTYPES_H 1
| #define HAVE_STDINT_H 1
| #define HAVE_UNISTD_H 1
| #define __EXTENSIONS__ 1
| #define _ALL_SOURCE 1
| #define _GNU_SOURCE 1
| #define _POSIX_PTHREAD_SEMANTICS 1
| #define _TANDEM_SOURCE 1
| #define PACKAGE "ucx"
| #define VERSION "1.10"
| #define HAVE_DLFCN_H 1
| #define LT_OBJDIR ".libs/"
| #define STDC_HEADERS 1
| #define restrict __restrict
| #define HAVE_DECL_STRERROR_R 1
| #define HAVE_STRERROR_R 1
| #define STRERROR_R_CHAR_P 1
| #define UCX_CONFIGURE_FLAGS "--enable-gtest --enable-examples --with-valgrind --enable-profiling --enable-frame-pointer --enable-stats --enable-memtrack --enable-fault-injection --enable-debug-data --enable-mt --prefix=/home/alex/workspace/ucx/build --enable-ucg"
| #define UCX_MODULE_SUBDIR "ucx"
| /* end confdefs.h.  */
| #include "uct/api/uct.h"
| int
| main ()
| {
| uct_completion_callback_t func = NULL;
|                                    func(NULL, UCS_OK);
|   ;
|   return 0;
| }
configure:18667: result: uct_completion_callback_t has no status argument
configure:18712: Building with Groups and collective operations support (UCG)

@xkszltl - if I understood correctly, you're using the v1.9.0 tag. Could you please post the line from your config.log referring to uct_completion_callback_t ?

xkszltl commented 3 years ago
image
xkszltl commented 3 years ago

The error was found in a docker build (so it's clean and reproducible). Interestingly I didn't hit the issue when I try the script locally on my dev machine:

configure:17486: checking if ln -s supports --relative
configure:17489: result: yes
configure:17908: checking for dot
configure:17924: found /usr/bin/dot
configure:17935: result: yes
configure:18007: ccache /opt/rh/devtoolset-9/root/usr/bin/gcc -c      -O3 -fdebug-prefix-map='/media/Scratch/tmp.QR7Pb7ZPGh'='/usr/local/src' -g -Isrc/  conftest.c >&5
configure:18007: $? = 0
configure:18008: result: uct_completion_callback_t has a status argument
configure:18059: Building with Groups and collective operations support (UCG)
configure:20156: checking for size_t
configure:20156: ccache /opt/rh/devtoolset-9/root/usr/bin/gcc -c      -O3 -fdebug-prefix-map='/media/Scratch/tmp.QR7Pb7ZPGh'='/usr/local/src' -g  conftest.c >&5
configure:20156: $? = 0
configure:20156: ccache /opt/rh/devtoolset-9/root/usr/bin/gcc -c      -O3 -fdebug-prefix-map='/media/Scratch/tmp.QR7Pb7ZPGh'='/usr/local/src' -g  conftest.c >&5
conftest.c: In function 'main':
conftest.c:74:21: error: expected expression before ')' token
   74 | if (sizeof ((size_t)))
      |                     ^
configure:20156: $? = 1

In the bad docker build I found:

configure:17847: ccache /opt/rh/devtoolset-9/root/usr/bin/gcc -c      -O3 -fdebug-prefix-map='/tmp/scratch'='/usr/local/src' -g -Isrc/  conftest.c >&5
In file included from conftest.c:35:
src/uct/api/uct.h:16:10: fatal error: uct/api/version.h: No such file or directory
   16 | #include <uct/api/version.h>
      |          ^~~~~~~~~~~~~~~~~~~
compilation terminated.

I also found a line saying config.status:1614: creating src/uct/api/version.h so I guess it's generated. In local build, because I had ucg installed previously, it probably found it through -I/usr/local/include given by others.

xkszltl commented 3 years ago

To me this sounds like bad timing due to missing dependency.