Question about invalidate_submit in an MPI environment

Muxas commented 1 year ago

Dear StarPU team,

Under certain circumstances (described below) when processing large amounts of data on a single node with several GPUs with help of StarPU-MPI I encounter the following error:

[starpu][_starpu_redux_init_data_replicate][assert failure] There is no initialisation codelet for the reduction of the handle 0x8489200. Maybe you forget to call starpu_data_set_reduction_methods() ?

python3: datawizard/reduction.c:84: _starpu_redux_init_data_replicate: Assertion `0 && "init_cl"' failed.

I am not using reduction in any of my low-level StarPU codelets. For now I rely on plain STARPU_R, STARPU_W and STARPU_RW modes. I do rely on StarPU-MPI, but with a single MPI node. I am NOT using starpu_mpi_task_insert utility, but I use node-level starpu_task_insert with corresponding StarPU-MPI transfer functions, that support caching.

After I added several "starpu_data_invalidate_submit" calls I started getting the above mentioned error. But only at a certain scale of problem sizes, so I cannot provide a backtrace (not immediately, at least). Therefore I would like to ask a question: do I need to "flush" MPI cache manually if I invalidate data using StarPU-MPI library? I could not find the answer in the documentation of StarPU, but this is the first reason of the problem that came into my mind.

P.S. I am using newly released StarPU-1.4.0

Thank you!

Muxas commented 1 year ago

This is a backtrace I got so far:

/usr/local/lib/libstarpu-1.4.so.1(+0xda320)[0x7fec5858a320]
/usr/local/lib/libstarpu-1.4.so.1(+0xbe412)[0x7fec5856e412]
/usr/local/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x33e)[0x7fec585f364e]
/usr/local/lib/libstarpu-1.4.so.1(+0x14431d)[0x7fec585f431d]
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fec590d6609]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fec59210133]

[starpu][_starpu_redux_init_data_replicate][assert failure] There is no initialisation codelet for the reduction of the handle 0x59db010. Maybe you forget to call starpu_data_set_reduction_methods() ?

python3: datawizard/reduction.c:84: _starpu_redux_init_data_replicate: Assertion `0 && "init_cl"' failed.

Thread 4 "CUDA 1" received signal SIGABRT, Aborted.
[Switching to Thread 0x7febe253e700 (LWP 131)]
0x00007fec5913400b in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6

(gdb) thread apply all bt

Thread 25 (Thread 0x7feb7e7fc700 (LWP 156)):
#0  0x00007fec5921046e in epoll_wait () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fec1be635e9 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7
#2  0x00007fec1be59625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7
#3  0x00007febe04d1d56 in ?? () from /usr/lib/x86_64-linux-gnu/libpmix.so.2
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 24 (Thread 0x7feb7effd700 (LWP 155)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fec1be62981 in ?? () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7
#2  0x00007fec1be59625 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.1.so.7
#3  0x00007fec52329706 in ?? () from /usr/lib/x86_64-linux-gnu/libopen-pal.so.40
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 23 (Thread 0x7feb7f7fe700 (LWP 150)):
#0  0x00007fec590dd376 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007fec589b2ecf in _starpu_mpi_progress_thread_func (arg=0x540c250) at mpi/starpu_mpi_mpi.c:1347
#2  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#3  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 22 (Thread 0x7feb7ffff700 (LWP 149)):
#0  0x00007fec585743e7 in ____starpu_datawizard_progress (memory_node=memory_node@entry=0, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:52
#1  0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=0, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#2  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:149
#3  0x00007fec585ed4af in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7fec588ce500 <_starpu_config+26944>) at drivers/cpu/driver_cpu.c:603
#4  0x00007fec585edb5d in _starpu_cpu_worker (arg=0x7fec588ce500 <_starpu_config+26944>) at drivers/cpu/driver_cpu.c:712
#5  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 21 (Thread 0x7feba8ff9700 (LWP 148)):
#0  _starpu_exponential_backoff (worker=0x7fec588cddc0 <_starpu_config+25088>, worker=0x7fec588cddc0 <_starpu_config+25088>) at drivers/driver_common/driver_common.c:392
#1  _starpu_get_worker_task (worker=worker@entry=0x7fec588cddc0 <_starpu_config+25088>, workerid=workerid@entry=10, memnode=memnode@entry=0) at drivers/driver_common/driver_common.c:503
#2  0x00007fec585ed4bc in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7fec588cddc0 <_starpu_config+25088>) at drivers/cpu/driver_cpu.c:606
#3  0x00007fec585edb5d in _starpu_cpu_worker (arg=0x7fec588cddc0 <_starpu_config+25088>) at drivers/cpu/driver_cpu.c:712
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 20 (Thread 0x7feba97fa700 (LWP 147)):
#0  0x00007fec58572b64 in __starpu_handle_node_data_requests (reqlist=0x7fec5890ab98 <_starpu_config+274392>, handling_node=handling_node@entry=0, peer_node=peer_node@entry=5, inout=inout@entry=_STARPU_DATA_REQUEST_OUT, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7feba97f9b64, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:674
#1  0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=0, peer_node=peer_node@entry=5, inout=inout@entry=_STARPU_DATA_REQUEST_OUT, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7feba97f9b64) at ./core/workers.h:677
#2  0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=0, peer_start=peer_start@entry=5, peer_end=peer_end@entry=6, inout=inout@entry=_STARPU_DATA_REQUEST_OUT, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#3  0x00007fec58574572 in ___starpu_datawizard_progress (memory_node=0, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:105
--Type <RET> for more, q to quit, c to continue without paging--c
#4  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:149
#5  0x00007fec585ed4af in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7fec588cd680 <_starpu_config+23232>) at drivers/cpu/driver_cpu.c:603
#6  0x00007fec585edb5d in _starpu_cpu_worker (arg=0x7fec588cd680 <_starpu_config+23232>) at drivers/cpu/driver_cpu.c:712
#7  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 19 (Thread 0x7feba9ffb700 (LWP 146)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 18 (Thread 0x7febaa7fc700 (LWP 145)):
#0  0x00007fec585741d2 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=0, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_OUT, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febaa7fbb64) at ./core/workers.h:677
#1  0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=0, peer_start=peer_start@entry=8, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_OUT, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#2  0x00007fec58574572 in ___starpu_datawizard_progress (memory_node=0, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:105
#3  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:149
#4  0x00007fec585ed4af in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7fec588ccf40 <_starpu_config+21376>) at drivers/cpu/driver_cpu.c:603
#5  0x00007fec585edb5d in _starpu_cpu_worker (arg=0x7fec588ccf40 <_starpu_config+21376>) at drivers/cpu/driver_cpu.c:712
#6  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 17 (Thread 0x7febaaffd700 (LWP 144)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 16 (Thread 0x7febab7fe700 (LWP 143)):
#0  0x00007fec590dbe8b in pthread_rwlock_wrlock () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007febefbb592a in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefb8c7de in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefb95025 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007febefb95ba4 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007febefb9615d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007febefa815ad in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007febefa82482 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007febefc2d390 in cuMemAlloc_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fec5204affe in ?? () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#10 0x00007fec5202832b in ?? () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#11 0x00007fec52055b13 in cudaMalloc () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#12 0x00007fec585f13cd in _starpu_cuda_malloc_on_node (dst_node=<optimized out>, size=100000000, flags=<optimized out>) at drivers/cuda/driver_cuda.c:1045
#13 0x00007fec5857c841 in _starpu_malloc_on_node (dst_node=dst_node@entry=8, size=size@entry=100000000, flags=2, flags@entry=6) at datawizard/malloc.c:779
#14 0x00007fec5857da6e in starpu_malloc_on_node_flags (dst_node=8, size=size@entry=100000000, flags=6) at datawizard/malloc.c:924
#15 0x00007fec5857e470 in starpu_malloc_on_node (dst_node=<optimized out>, size=size@entry=100000000) at ./core/workers.h:677
#16 0x00007fec5859b03b in allocate_variable_buffer_on_node (data_interface_=0x7febab7fd440, dst_node=<optimized out>) at datawizard/interfaces/variable_interface.c:237
#17 0x00007fec58584564 in _starpu_allocate_interface (only_fast_alloc=0, is_prefetch=STARPU_TASK_PREFETCH, dst_node=8, replicate=0x7fec522a9d30 <fut_active>, handle=0x5a7b330) at datawizard/memalloc.c:1542
#18 _starpu_allocate_memory_on_node (handle=handle@entry=0x5a7b330, replicate=replicate@entry=0x5a7be00, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH, only_fast_alloc=only_fast_alloc@entry=0) at datawizard/memalloc.c:1673
#19 0x00007fec585753dd in _starpu_driver_copy_data_1_to_1 (handle=handle@entry=0x5a7b330, src_replicate=0x5a7be00, dst_replicate=dst_replicate@entry=0x5a7be00, donotread=1, req=req@entry=0x7fea9dce5830, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, prefetch=STARPU_TASK_PREFETCH) at datawizard/copy_driver.c:295
#20 0x00007fec585731e9 in starpu_handle_data_request (may_alloc=_STARPU_DATAWIZARD_DO_ALLOC, r=0x7fea9dce5830) at datawizard/data_request.c:626
#21 __starpu_handle_node_data_requests (reqlist=<optimized out>, handling_node=handling_node@entry=8, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7febab7fdaf4, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:744
#22 0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=8, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febab7fdaf4) at ./core/workers.h:677
#23 0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=8, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#24 0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=8, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#25 0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=1) at datawizard/datawizard.c:149
#26 0x00007fec585f3d93 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588cc800 <_starpu_config+19520>) at drivers/cuda/driver_cuda.c:2215
#27 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588cc800 <_starpu_config+19520>) at drivers/cuda/driver_cuda.c:2292
#28 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#29 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 15 (Thread 0x7febabfff700 (LWP 142)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 14 (Thread 0x7febc4ff9700 (LWP 141)):
#0  0x00007febefced8b1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007febefd37a70 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefd37b81 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefaa2941 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007febefb4e600 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007febefb4e79f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007febefcab342 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007febefb6319e in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007febefa7a4b1 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007febefc3a759 in cuEventQuery () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007fec52026ade in ?? () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#11 0x00007fec52053038 in cudaEventQuery () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#12 0x00007fec585f34ee in _starpu_cuda_driver_run_once (worker=worker@entry=0x7fec588cc0c0 <_starpu_config+17664>) at drivers/cuda/driver_cuda.c:2141
#13 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588cc0c0 <_starpu_config+17664>) at drivers/cuda/driver_cuda.c:2292
#14 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 13 (Thread 0x7febc57fa700 (LWP 140)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 12 (Thread 0x7febc5ffb700 (LWP 139)):
#0  0x00007fec5859ad58 in variable_compare (data_interface_a=0x57ed5a0, data_interface_b=0x9167ec0) at datawizard/interfaces/variable_interface.c:152
#1  0x00007fec5857f9c3 in _starpu_data_interface_compare (data_interface_a=<optimized out>, ops_a=0x7fec58664960 <starpu_interface_variable_ops>, data_interface_b=<optimized out>, ops_b=<optimized out>, ops_b=<optimized out>) at datawizard/memalloc.c:729
#2  0x00007fec58584835 in try_to_reuse_potentially_in_use_mc (is_prefetch=STARPU_TASK_PREFETCH, footprint=3353864211, replicate=<optimized out>, handle=0x57ebe70, node=6) at datawizard/memalloc.c:886
#3  _starpu_allocate_interface (only_fast_alloc=0, is_prefetch=STARPU_TASK_PREFETCH, dst_node=6, replicate=0x750, handle=0x57ebe70) at datawizard/memalloc.c:1575
#4  _starpu_allocate_memory_on_node (handle=handle@entry=0x57ebe70, replicate=replicate@entry=0x57ec6d0, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH, only_fast_alloc=only_fast_alloc@entry=0) at datawizard/memalloc.c:1673
#5  0x00007fec585753dd in _starpu_driver_copy_data_1_to_1 (handle=handle@entry=0x57ebe70, src_replicate=0x57ec940, dst_replicate=dst_replicate@entry=0x57ec6d0, donotread=0, req=req@entry=0x7fea9d9645b0, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, prefetch=STARPU_TASK_PREFETCH) at datawizard/copy_driver.c:295
#6  0x00007fec585731e9 in starpu_handle_data_request (may_alloc=_STARPU_DATAWIZARD_DO_ALLOC, r=0x7fea9d9645b0) at datawizard/data_request.c:626
#7  __starpu_handle_node_data_requests (reqlist=<optimized out>, handling_node=handling_node@entry=6, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7febc5ffaaf4, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:744
#8  0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=6, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febc5ffaaf4) at ./core/workers.h:677
#9  0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=6, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#10 0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=6, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#11 0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=1) at datawizard/datawizard.c:149
#12 0x00007fec585f3d93 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588cb980 <_starpu_config+15808>) at drivers/cuda/driver_cuda.c:2215
#13 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588cb980 <_starpu_config+15808>) at drivers/cuda/driver_cuda.c:2292
#14 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 11 (Thread 0x7febc67fc700 (LWP 138)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 10 (Thread 0x7febc6ffd700 (LWP 137)):
#0  0x00007febefcaaa9d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007febefcab297 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefcabb5f in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefb988c7 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007febefccd9ab in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007febefa84674 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007febefa85825 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007febefc55215 in cuMemcpyPeerAsync () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fec52029397 in ?? () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#9  0x00007fec5205923a in cudaMemcpyPeerAsync () from /usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudart.so.11.0
#10 0x00007fec585f15eb in starpu_cuda_copy_async_sync (src_ptr=src_ptr@entry=0x7fdbfa000000, src_node=src_node@entry=4, dst_ptr=dst_ptr@entry=0x7fa81c000000, dst_node=dst_node@entry=5, ssize=ssize@entry=100000000, stream=0x7febb0776080, kind=cudaMemcpyDeviceToDevice) at drivers/cuda/driver_cuda.c:1132
#11 0x00007fec585f1841 in _starpu_cuda_copy_data_from_cuda_to_cuda (src=140582768869376, src_offset=0, src_node=4, dst=140360000995328, dst_offset=0, dst_node=5, size=100000000, async_channel=0x7fea49b4e228) at drivers/cuda/driver_cuda.c:1602
#12 0x00007fec58575c3f in starpu_interface_copy (src=<optimized out>, src_offset=src_offset@entry=0, src_node=src_node@entry=4, dst=<optimized out>, dst_offset=dst_offset@entry=0, dst_node=dst_node@entry=5, size=100000000, async_data=0x7fea49b4e228) at datawizard/copy_driver.c:430
#13 0x00007fec5859aeb3 in copy_any_to_any (src_interface=<optimized out>, src_node=4, dst_interface=<optimized out>, dst_node=5, async_data=<optimized out>) at datawizard/interfaces/variable_interface.c:310
#14 0x00007fec585f2372 in _starpu_cuda_copy_interface_from_cuda_to_cuda (handle=<optimized out>, src_interface=0x5778930, src_node=<optimized out>, dst_interface=0x5778960, dst_node=5, req=0x7fea49b4e1d0) at drivers/cuda/driver_cuda.c:1467
#15 0x00007fec585755b1 in copy_data_1_to_1_generic (req=0x5777260, dst_replicate=0xbe929ab07c4eca00, src_replicate=<optimized out>, handle=0x5777260) at datawizard/copy_driver.c:205
#16 _starpu_driver_copy_data_1_to_1 (handle=handle@entry=0x5777260, src_replicate=<optimized out>, dst_replicate=dst_replicate@entry=0x5777988, donotread=<optimized out>, req=req@entry=0x7fea49b4e1d0, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, prefetch=STARPU_TASK_PREFETCH) at datawizard/copy_driver.c:363
#17 0x00007fec585731e9 in starpu_handle_data_request (may_alloc=_STARPU_DATAWIZARD_DO_ALLOC, r=0x7fea49b4e1d0) at datawizard/data_request.c:626
#18 __starpu_handle_node_data_requests (reqlist=<optimized out>, handling_node=handling_node@entry=5, peer_node=peer_node@entry=4, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7febc6ffcaf4, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:744
#19 0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=5, peer_node=peer_node@entry=4, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febc6ffcaf4) at ./core/workers.h:677
#20 0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=5, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#21 0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=5, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#22 0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=1) at datawizard/datawizard.c:149
#23 0x00007fec585f3d93 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588cb240 <_starpu_config+13952>) at drivers/cuda/driver_cuda.c:2215
#24 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588cb240 <_starpu_config+13952>) at drivers/cuda/driver_cuda.c:2292
#25 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#26 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 9 (Thread 0x7febc77fe700 (LWP 136)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 8 (Thread 0x7febc7fff700 (LWP 135)):
#0  starpu_data_can_evict (handle=0x5abece0, node=4, is_prefetch=STARPU_TASK_PREFETCH) at datawizard/memalloc.c:542
#1  0x00007fec58581869 in try_to_throw_mem_chunk (mc=<optimized out>, node=node@entry=4, replicate=replicate@entry=0x0, is_already_in_mc_list=is_already_in_mc_list@entry=0, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH) at datawizard/memalloc.c:575
#2  0x00007fec58581f79 in free_potentially_in_use_mc (node=node@entry=4, force=force@entry=0, reclaim=reclaim@entry=200000000, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH) at datawizard/memalloc.c:1010
#3  0x00007fec5858217a in free_potentially_in_use_mc (is_prefetch=STARPU_TASK_PREFETCH, reclaim=200000000, force=0, node=4) at ./core/workers.h:676
#4  _starpu_memory_reclaim_generic (node=node@entry=4, force=force@entry=0, reclaim=reclaim@entry=200000000, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH) at datawizard/memalloc.c:1079
#5  0x00007fec585848d1 in _starpu_allocate_interface (only_fast_alloc=0, is_prefetch=STARPU_TASK_PREFETCH, dst_node=4, replicate=0x7fec522a9d30 <fut_active>, handle=0x59878f0) at datawizard/memalloc.c:1590
#6  _starpu_allocate_memory_on_node (handle=handle@entry=0x59878f0, replicate=replicate@entry=0x5987ee0, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH, only_fast_alloc=only_fast_alloc@entry=0) at datawizard/memalloc.c:1673
#7  0x00007fec585753dd in _starpu_driver_copy_data_1_to_1 (handle=handle@entry=0x59878f0, src_replicate=0x5987a00, dst_replicate=dst_replicate@entry=0x5987ee0, donotread=0, req=req@entry=0x7fea9db602b0, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, prefetch=STARPU_TASK_PREFETCH) at datawizard/copy_driver.c:295
#8  0x00007fec585731e9 in starpu_handle_data_request (may_alloc=_STARPU_DATAWIZARD_DO_ALLOC, r=0x7fea9db602b0) at datawizard/data_request.c:626
#9  __starpu_handle_node_data_requests (reqlist=<optimized out>, handling_node=handling_node@entry=4, peer_node=peer_node@entry=0, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7febc7ffeaf4, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:744
#10 0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=4, peer_node=peer_node@entry=0, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febc7ffeaf4) at ./core/workers.h:677
#11 0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=4, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#12 0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=4, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#13 0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=1) at datawizard/datawizard.c:149
#14 0x00007fec585f3d93 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588cab00 <_starpu_config+12096>) at drivers/cuda/driver_cuda.c:2215
#15 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588cab00 <_starpu_config+12096>) at drivers/cuda/driver_cuda.c:2292
#16 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#17 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 7 (Thread 0x7febe0d3b700 (LWP 134)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 6 (Thread 0x7febe1d3d700 (LWP 133)):
#0  0x00007fec5920399f in poll () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb190b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefc7d8da in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7febe153c700 (LWP 132)):
#0  0x00007fec58584830 in try_to_reuse_potentially_in_use_mc (is_prefetch=STARPU_TASK_PREFETCH, footprint=3353864211, replicate=<optimized out>, handle=0x5aa6770, node=3) at datawizard/memalloc.c:886
#1  _starpu_allocate_interface (only_fast_alloc=0, is_prefetch=STARPU_TASK_PREFETCH, dst_node=3, replicate=0x3a8, handle=0x5aa6770) at datawizard/memalloc.c:1575
#2  _starpu_allocate_memory_on_node (handle=handle@entry=0x5aa6770, replicate=replicate@entry=0x5aa6c28, is_prefetch=is_prefetch@entry=STARPU_TASK_PREFETCH, only_fast_alloc=only_fast_alloc@entry=0) at datawizard/memalloc.c:1673
#3  0x00007fec585753dd in _starpu_driver_copy_data_1_to_1 (handle=handle@entry=0x5aa6770, src_replicate=0x5aa6af0, dst_replicate=dst_replicate@entry=0x5aa6c28, donotread=0, req=req@entry=0x7fea65b528f0, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, prefetch=STARPU_TASK_PREFETCH) at datawizard/copy_driver.c:295
#4  0x00007fec585731e9 in starpu_handle_data_request (may_alloc=_STARPU_DATAWIZARD_DO_ALLOC, r=0x7fea65b528f0) at datawizard/data_request.c:626
#5  __starpu_handle_node_data_requests (reqlist=<optimized out>, handling_node=handling_node@entry=3, peer_node=peer_node@entry=2, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=2, pushed=0x7febe153baf4, prefetch=STARPU_PREFETCH) at datawizard/data_request.c:744
#6  0x00007fec585741f8 in _starpu_handle_node_prefetch_requests (handling_node=handling_node@entry=3, peer_node=peer_node@entry=2, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febe153baf4) at ./core/workers.h:677
#7  0x00007fec5857443f in ____starpu_datawizard_progress (memory_node=memory_node@entry=3, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:66
#8  0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=3, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#9  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=1) at datawizard/datawizard.c:149
#10 0x00007fec585f3d93 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588ca3c0 <_starpu_config+10240>) at drivers/cuda/driver_cuda.c:2215
#11 0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588ca3c0 <_starpu_config+10240>) at drivers/cuda/driver_cuda.c:2292
#12 0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7febe253e700 (LWP 131)):
#0  0x00007fec5913400b in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fec59113859 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fec59113729 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fec59124fd6 in __assert_fail () from /usr/lib/x86_64-linux-gnu/libc.so.6
#4  0x00007fec5858a375 in _starpu_redux_init_data_replicate (handle=handle@entry=0x59db010, replicate=0x59db390, workerid=workerid@entry=1) at datawizard/reduction.c:84
#5  0x00007fec5856e412 in _starpu_fetch_task_input_tail (task=task@entry=0x6bb8d70, j=j@entry=0x6bb8fa0, worker=worker@entry=0x7fec588c9c80 <_starpu_config+8384>) at datawizard/coherency.c:1312
#6  0x00007fec585f364e in _starpu_cuda_driver_run_once (worker=worker@entry=0x7fec588c9c80 <_starpu_config+8384>) at drivers/cuda/driver_cuda.c:2102
#7  0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588c9c80 <_starpu_config+8384>) at drivers/cuda/driver_cuda.c:2292
#8  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7febe8a27700 (LWP 130)):
#0  0x00007fec58572b6c in __starpu_handle_node_data_requests (reqlist=0x7fec5890c168 <_starpu_config+279976>, handling_node=handling_node@entry=1, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, n=n@entry=1, pushed=0x7febe8a26af4, prefetch=STARPU_IDLEFETCH) at datawizard/data_request.c:674
#1  0x00007fec585742c8 in _starpu_handle_node_idle_requests (handling_node=handling_node@entry=1, peer_node=peer_node@entry=8, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, pushed=pushed@entry=0x7febe8a26af4) at ./core/workers.h:677
#2  0x00007fec585744b1 in ____starpu_datawizard_progress (memory_node=memory_node@entry=1, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:78
#3  0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=1, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#4  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:149
#5  0x00007fec585f3a95 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x7fec588c9540 <_starpu_config+6528>) at drivers/cuda/driver_cuda.c:2221
#6  0x00007fec585f431d in _starpu_cuda_worker (_arg=0x7fec588c9540 <_starpu_config+6528>) at drivers/cuda/driver_cuda.c:2292
#7  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7febe9967700 (LWP 129)):
#0  0x00007fec59211b30 in accept4 () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007febefbb29c6 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007febefba3f6d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007febefbb4a18 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#5  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

Thread 1 (Thread 0x7fec58f25740 (LWP 124)):
#0  0x00007fec591ce23f in clock_nanosleep () from /usr/lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fec591d3ec7 in nanosleep () from /usr/lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fec58c77cc5 in std::this_thread::sleep_for<long, std::ratio<1l, 1000l> > (__rtime=...) at /usr/include/c++/9/thread:378
#3  <lambda()>::operator() (__closure=<optimized out>) at /nntile/wrappers/python/nntile/nntile_core.cc:53
#4  pybind11::detail::argument_loader<>::call_impl<void, def_mod_starpu(pybind11::module_&)::<lambda()>&, pybind11::detail::void_type> (this=<synthetic pointer>, f=...) at /nntile/external/pybind11/include/pybind11/cast.h:1443
#5  pybind11::detail::argument_loader<>::call<void, pybind11::detail::void_type, def_mod_starpu(pybind11::module_&)::<lambda()>&> (this=<synthetic pointer>, f=...) at /nntile/external/pybind11/include/pybind11/cast.h:1417
#6  pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (this=0x0, call=...) at /nntile/external/pybind11/include/pybind11/pybind11.h:248
#7  pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &) () at /nntile/external/pybind11/include/pybind11/pybind11.h:223
#8  0x00007fec58c9f3c8 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fec58eb3040, kwargs_in=0x0) at /nntile/external/pybind11/include/pybind11/pybind11.h:939
#9  0x00000000005f6489 in PyCFunction_Call ()
#10 0x00000000005f7056 in _PyObject_MakeTpCall ()
#11 0x000000000057107e in _PyEval_EvalFrameDefault ()
#12 0x0000000000569cea in _PyEval_EvalCodeWithName ()
#13 0x000000000068e7b7 in PyEval_EvalCode ()
#14 0x0000000000680001 in ?? ()
#15 0x000000000068007f in ?? ()
#16 0x0000000000680121 in ?? ()
#17 0x0000000000680db7 in PyRun_SimpleFileExFlags ()
#18 0x00000000006b8122 in Py_RunMain ()
#19 0x00000000006b84ad in Py_BytesMain ()
#20 0x00007fec59115083 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6
#21 0x00000000005fb39e in _start ()

Muxas commented 1 year ago

Now I am sure it is not my code that ruins large runs. Chameleon-based code also does the same. At the URL one can find an example of so-called Deep Linear neural network https://github.com/Muxas/deep_linear_network/blob/main/chameleon/test.cc

I compiled and installed StarPU-1.4 and the latest Chameleon (commit 5355b9a3b51dbbb6df550617cca0bd86d0772975 from https://gitlab.inria.fr/solverstack/chameleon.git ). Provided test.cc (from the link above) is linked against Chameleon.

I run example on DGX-1 server with 8 Nvidia V100 with the following command:

STARPU_NCPU=5 STARPU_SCHED_BETA=0 ./a.out 50000 20000 4 5000

And then (after some time) I get the following output:

 B is the number of input samples
 N is the number of inputs/outputs of each linear layer
 D is the number of linear layers
 NB is the tile size (parameter of the CHAMELEON library)

 1. Generate random BxN matrix X_0 and D random NxN matrices W_0, ..., W_{D-1}
 2. Compute BxN matrices X_1 = X_0 W_0, ..., X_D = X_{D-1} W_{D-1}
 3. Set BxN matrix G_D = X_D
 4. Multiply G_{D-1} = G_D W_{D-1}', ..., G_1 = G_2 W_1'
 5. Compute NxN matrices Y_D = X_D' G_D, ..., Y_1 = X_1' G_1
 6. Update W_i += 1e-16 Y_{i+1}
B=50000 N=20000 D=4 NB=5000
Allocating memory (50.6639 GBytes)...
Total memory allocated on RAM is 50.6639 GBytes

Initializing random data...
Initialization Complete
Starting the process
/usr/local/lib/libstarpu-1.4.so.1(+0xda320)[0x7f5bcb7bf320]
/usr/local/lib/libstarpu-1.4.so.1(+0xbe412)[0x7f5bcb7a3412]
/usr/local/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x33e)[0x7f5bcb82864e]
/usr/local/lib/libstarpu-1.4.so.1(+0x14431d)[0x7f5bcb82931d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f5bbfd7d609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f5bcb046133]

[starpu][_starpu_redux_init_data_replicate][assert failure] There is no initialisation codelet for the reduction of the handle 0x55cf68ae5e40. Maybe you forget to call starpu_data_set_reduction_methods() ?

a.out: datawizard/reduction.c:84: _starpu_redux_init_data_replicate: Assertion `0 && "init_cl"' failed.
Aborted (core dumped)

Zoragna commented 1 year ago

Hello,

reduction patterns are used in GEMM by Chameleon if it switches to the A-stat variant of the algorithm (see https://gitlab.inria.fr/pswartva/chameleon/-/blob/master/compute/zgemm.c#L162). If your input matrices have sizes that match this decision, then it's possible that the failure is caused by these reduction patterns.

One question : do you enable CPU worker on your machine ?

Chameleon lacks a GPU codelet for init_cl (search for zgersum_init in the source files). Therefore I'd suggest the following: 1) try to setup the algorithm with the proper env. variable : export CHAMELEON_GEMM_ALGO="summa_c" 2) try adding CPU worker (at least one)

(it's been 2 weeks so maybe you've been able to sort things out yourself ; if you've added a GPU initialization codelet maybe people working on chameleon would be happy to integrate it)

Muxas commented 1 year ago

Hi!

I think this is a problem on the StarPU side. I did found a solution, however: downgrading from version 1.4.0 to version 1.3.10 resolved it. Both my hidden application and provided Chameleon application throw the same exception with the 1.4.0 version of StarPU and work correctly with the 1.3.10 version. From the provided backtrace (copied from a long message above) one can see that at least one CPU worker is enabled:

Thread 22 (Thread 0x7feb7ffff700 (LWP 149)):
#0  0x00007fec585743e7 in ____starpu_datawizard_progress (memory_node=memory_node@entry=0, peer_start=peer_start@entry=0, peer_end=peer_end@entry=9, inout=inout@entry=_STARPU_DATA_REQUEST_IN, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=<optimized out>, push_requests@entry=1) at datawizard/datawizard.c:52
#1  0x00007fec5857454c in ___starpu_datawizard_progress (memory_node=0, nnodes=nnodes@entry=9, may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:101
#2  0x00007fec5857465a in __starpu_datawizard_progress (may_alloc=may_alloc@entry=_STARPU_DATAWIZARD_DO_ALLOC, push_requests=push_requests@entry=1) at datawizard/datawizard.c:149
#3  0x00007fec585ed4af in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7fec588ce500 <_starpu_config+26944>) at drivers/cpu/driver_cpu.c:603
#4  0x00007fec585edb5d in _starpu_cpu_worker (arg=0x7fec588ce500 <_starpu_config+26944>) at drivers/cpu/driver_cpu.c:712
#5  0x00007fec590d6609 in start_thread () from /usr/lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007fec59210133 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6

All codelets within my own application (not provided here), except starpu_data_cpy, do not use any STARPU_REDUX or STARPU_COMMUTE access modes, so there shall be no reduction. But StarPU 1.4.0 throws an error related to reduction codelet. It might be starpu_data_cpy issue or something else.

nfurmento commented 1 year ago

Hello @Muxas Have you tried the solution proposed by @Zoragna ? Could you please try with the latest 1.4 branch ? We did fix many bugs. If all of that fails, please send us information on the topology of the machine you are using, with the contents of the StarPU file config.log. Thanks, Nathalie

Muxas commented 1 year ago

Hello!

I did try the latest commit of the branch starpu-1.4 from the https://gitlab.inria.fr/starpu/starpu.git No solution

Adding CPU workers did not help (they were enabled in any case)
Switching summation algorithm in the Chameleon did not help

The only option that works is switching back to StarPU-1.3.10 tag

nfurmento commented 1 year ago

ok, could you please send us the config.log file of StarPU ? Along with the output of starpu_machine_display ?

Muxas commented 1 year ago

I did not found config.log. Do you mean output of the configure script?

    CPUs     enabled: yes
    CUDA     enabled: yes
    HIP      enabled: no
    OpenCL   enabled: no
    Max FPGA enabled: no

    Compile-time limits
    (change these with --enable-maxcpus, --enable-maxcudadev,
    --enable-maxopencldev, --enable-maxmaxfpgadev, --enable-maxnodes, --enable-maxbuffers)
        (Note these numbers do not represent the number of detected
    devices, but the maximum number of devices StarPU can manage)

    Maximum number of CPUs:                        128
    Maximum number of CUDA devices:                8
    Maximum number of HIP devices:                 8
    Maximum number of OpenCL devices:              0
    Maximum number of Maxeler FPGA devices:        0
    Maximum number of MPI master-slave devices:    1
    Maximum number of TCP/IP master-slave devices: 1
    Maximum number of memory nodes:                16
    Maximum number of task buffers:                8

    CUDA GPU-GPU transfers: yes
    CUDA Map:               yes
    HIP GPU-GPU transfers:  no
    Allocation cache:       yes

    Magma enabled:     no
    BLAS library:      none
    hwloc:             yes
    FxT trace enabled: yes

        Documentation HTML:  no
        Documentation PDF:   no
        Examples:            no

    StarPU Extensions:
           StarPU MPI enabled:                            yes
           StarPU MPI failure tolerance:                  no
           StarPU MPI failure tolerance stats:            no
           StarPU MPI(nmad) enabled:                      no
           MPI test suite:                                yes
           Master-Slave MPI enabled:                      no
           Master-Slave TCP/IP enabled:                   no
           FFT Support:                                   no
           Resource Management enabled:                   no
           Python Interface enabled:                      no
           OpenMP runtime support enabled:                yes
           OpenMP LLVM runtime support enabled:           no
           Parallel Worker support enabled:               no
           SOCL enabled:                                  no
           SOCL test suite:                               no
           Scheduler Hypervisor:                          no
           simgrid enabled:                               no
           ayudame enabled:                               no
           HDF5 enabled:                                  no
           Native fortran support:                        no
           Native MPI fortran support:                    no
           Support for multiple linear regression models: no
           Hierarchical dags support:                     no
           JULIA enabled:                                 no

And here is the output for the command STARPU_NCPU=5 STARPU_NCUDA=3 STARPU_SCHED=dmda CUDA_VISIBLE_DEVICES=5,6,7 starpu_machine_display

Environment variables
    STARPU_NCPU=5
    STARPU_NCUDA=3
    STARPU_HOSTNAME=hachiko1

StarPU has found :
5 CPU workers:
    CPU 0
    CPU 1
    CPU 2
    CPU 3
    CPU 4
3 CUDA workers:
    CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 86:00.0)
    CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 89:00.0)
    CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 8a:00.0)
No OpenCL worker
No FPGA worker
No MPI_MS worker
No TCPIP_MS worker
No HIP worker

topology ... (hwloc logical indexes)
numa  0 pack  0 core 0  PU 0    CPU 0   
            PU 1    
        core 1  PU 2    CPU 1   
            PU 3    
        core 2  PU 4    CPU 2   
            PU 5    
        core 3  PU 6    CPU 3   
            PU 7    
        core 4  PU 8    CPU 4   
            PU 9    
        core 5  PU 10   
            PU 11   
        core 6  PU 12   
            PU 13   
        core 7  PU 14   
            PU 15   
        core 8  PU 16   
            PU 17   
        core 9  PU 18   
            PU 19   
        core 10 PU 20   
            PU 21   
        core 11 PU 22   
            PU 23   
        core 12 PU 24   
            PU 25   
        core 13 PU 26   
            PU 27   
        core 14 PU 28   
            PU 29   
        core 15 PU 30   
            PU 31   
        core 16 PU 32   
            PU 33   
        core 17 PU 34   
            PU 35   
        core 18 PU 36   
            PU 37   
        core 19 PU 38   
            PU 39   
numa  1 pack  1 core 20 PU 40   CUDA 0.0 (Tesla V100-SXM2-16GB 14.2 GiB 86:00.0)    
            PU 41   
        core 21 PU 42   CUDA 1.0 (Tesla V100-SXM2-16GB 14.2 GiB 89:00.0)    
            PU 43   
        core 22 PU 44   CUDA 2.0 (Tesla V100-SXM2-16GB 14.2 GiB 8a:00.0)    
            PU 45   
        core 23 PU 46   
            PU 47   
        core 24 PU 48   
            PU 49   
        core 25 PU 50   
            PU 51   
        core 26 PU 52   
            PU 53   
        core 27 PU 54   
            PU 55   
        core 28 PU 56   
            PU 57   
        core 29 PU 58   
            PU 59   
        core 30 PU 60   
            PU 61   
        core 31 PU 62   
            PU 63   
        core 32 PU 64   
            PU 65   
        core 33 PU 66   
            PU 67   
        core 34 PU 68   
            PU 69   
        core 35 PU 70   
            PU 71   
        core 36 PU 72   
            PU 73   
        core 37 PU 74   
            PU 75   
        core 38 PU 76   
            PU 77   
        core 39 PU 78   
            PU 79   

bandwidth (MB/s) and latency (us)...
from/to NUMA 0  CUDA 0  CUDA 1  CUDA 2  
NUMA 0  0   6787    6987    7066    
CUDA 0  6882    0   6827    7267    
CUDA 1  6273    9480    0   47148   
CUDA 2  6316    11163   47355   0   

NUMA 0  0   0   9   9   
CUDA 0  0   0   9   9   
CUDA 1  12  16  0   13  
CUDA 2  13  11  12  0   

GPU NUMA in preference order (logical index), host-to-device, device-to-host
CUDA_0   0 6987 6273     1 6827 9480    
CUDA_1   0 7066 6316     1 7267 11163   
CUDA_2   0 6504 6288     1 6747 10848

nfurmento commented 1 year ago

In the directory from which you started configure, there is a file named config.log.

nfurmento commented 1 year ago

hello, could you please also send the command you used to configure Chameleon ? I just tried on one of our machines, the example you sent fails with a out-of-memory error

./a.out 50000 20000 4 5000
test B N D NB
 B is the number of input samples
 N is the number of inputs/outputs of each linear layer
 D is the number of linear layers
 NB is the tile size (parameter of the CHAMELEON library)

 1. Generate random BxN matrix X_0 and D random NxN matrices W_0, ..., W_{D-1}
 2. Compute BxN matrices X_1 = X_0 W_0, ..., X_D = X_{D-1} W_{D-1}
 3. Set BxN matrix G_D = X_D
 4. Multiply G_{D-1} = G_D W_{D-1}', ..., G_1 = G_2 W_1'
 5. Compute NxN matrices Y_D = X_D' G_D, ..., Y_1 = X_1' G_1
 6. Update W_i += 1e-16 Y_{i+1}
B=50000 N=20000 D=4 NB=5000
Allocating memory (50.6639 GBytes)...
CHAMELEON ERROR: chameleon_desc_mat_alloc(): malloc() failed

a smaller problem works fine

./a.out 500 200 4 50
test B N D NB
 B is the number of input samples
 N is the number of inputs/outputs of each linear layer
 D is the number of linear layers
 NB is the tile size (parameter of the CHAMELEON library)

 1. Generate random BxN matrix X_0 and D random NxN matrices W_0, ..., W_{D-1}
 2. Compute BxN matrices X_1 = X_0 W_0, ..., X_D = X_{D-1} W_{D-1}
 3. Set BxN matrix G_D = X_D
 4. Multiply G_{D-1} = G_D W_{D-1}', ..., G_1 = G_2 W_1'
 5. Compute NxN matrices Y_D = X_D' G_D, ..., Y_1 = X_1' G_1
 6. Update W_i += 1e-16 Y_{i+1}
B=500 N=200 D=4 NB=50
Allocating memory (0.00506639 GBytes)...
Total memory allocated on RAM is 0.00506639 GBytes

Initializing random data...
Initialization Complete
Starting the process
Time, s: 0.363597
GFLOPS : 4.4
GFLOP/s: 12.1013

Muxas commented 1 year ago

Hi!

So, here is all the info:

Dockerfile to reproduce results: https://gist.github.com/Muxas/96067b796087a49b7a91af3b89fa7064#file-dockerfile

StarPU repo: https://gitlab.inria.fr/starpu/starpu.git StarPU commit: cc18609f561a9e67026c527f96df95ab34c5105f StarPU config.log: https://gist.github.com/Muxas/96067b796087a49b7a91af3b89fa7064#file-config-log

Chameleon repo: https://gitlab.inria.fr/solverstack/chameleon.git Chameleon commit 5355b9a3b51dbbb6df550617cca0bd86d0772975 Chameleon configuration command: cmake .. -DCMAKE_CUDA_ARCHITECTURES=native -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCHAMELEON_USE_CUDA=ON

Error appears only in multi-GPU execution. Single GPU is OK.

Problem sizes that raises [starpu][_starpu_redux_init_data_replicate][assert failure] are here. Since I am running it in a shared server, I am specifying which Nvidia GPUs to use. Unfortunately, smaller problem sizes did not raise the issue.

Allocates 28.3122 GBytes: STARPU_NCPU=5 STARPU_NCUDA=3 STARPU_SCHED=dmda CUDA_VISIBLE_DEVICES=5,6,7 ./a.out 20000 20000 4 5000
Allocates 18.1198 GBytes: STARPU_NCPU=5 STARPU_NCUDA=3 STARPU_SCHED=dmda CUDA_VISIBLE_DEVICES=5,6,7 ./a.out 16000 16000 4 4000

nfurmento commented 1 year ago

Could you please try by using a tile allocation mechanism ? When calling CHAMELEON_Desc_Create you need to replace the 2nd parameter nullptr by CHAMELEON_MAT_ALLOC_TILE. I cannot try on my machine as the allocation fails.

nfurmento commented 1 year ago

I am running the application on my machine with CHAMELEON_MAT_ALLOC_TILE. it takes a very long time, but so far, it did not crash.

Muxas commented 1 year ago

I did the following switch of lines in the Desc_Create phase:

    for(int i = 0; i <= D; ++i)
    {
        //CHAMELEON_Desc_Create(&desc_X[i], nullptr, ChamRealFloat, NB, NB,
        //        NB * NB, ldX, N, 0, 0, B, N, P, Q);
        //CHAMELEON_Desc_Create(&desc_G[i], nullptr, ChamRealFloat, NB, NB,
        //        NB * NB, ldG, N, 0, 0, B, N, P, Q);
        //CHAMELEON_Desc_Create(&desc_Y[i], nullptr, ChamRealFloat, NB, NB,
        //        NB * NB, ldY, N, 0, 0, N, N, P, Q);
        CHAMELEON_Desc_Create(&desc_X[i], CHAMELEON_MAT_ALLOC_TILE, ChamRealFloat, NB, NB,
                NB * NB, ldX, N, 0, 0, B, N, P, Q);
        CHAMELEON_Desc_Create(&desc_G[i], CHAMELEON_MAT_ALLOC_TILE, ChamRealFloat, NB, NB,
                NB * NB, ldG, N, 0, 0, B, N, P, Q);
        CHAMELEON_Desc_Create(&desc_Y[i], CHAMELEON_MAT_ALLOC_TILE, ChamRealFloat, NB, NB,
                NB * NB, ldY, N, 0, 0, N, N, P, Q);
    }
    for(int i = 0; i < D; ++i)
    {
        //CHAMELEON_Desc_Create(&desc_W[i], nullptr, ChamRealFloat, NB, NB,
        //        NB * NB, ldW, N, 0, 0, N, N, P, Q);
        CHAMELEON_Desc_Create(&desc_W[i], CHAMELEON_MAT_ALLOC_TILE, ChamRealFloat, NB, NB,
                NB * NB, ldW, N, 0, 0, N, N, P, Q);
    }

However, the result is still the same:

root@26382ef69eca:~/deep_linear_network/chameleon# STARPU_NCPU=5 STARPU_NCUDA=3 STARPU_SCHED=dmda CUDA_VISIBLE_DEVICES=4,6,7 ./a.out 20000 20000 4 5000
test B N D NB
 B is the number of input samples
 N is the number of inputs/outputs of each linear layer
 D is the number of linear layers
 NB is the tile size (parameter of the CHAMELEON library)

 1. Generate random BxN matrix X_0 and D random NxN matrices W_0, ..., W_{D-1}
 2. Compute BxN matrices X_1 = X_0 W_0, ..., X_D = X_{D-1} W_{D-1}
 3. Set BxN matrix G_D = X_D
 4. Multiply G_{D-1} = G_D W_{D-1}', ..., G_1 = G_2 W_1'
 5. Compute NxN matrices Y_D = X_D' G_D, ..., Y_1 = X_1' G_1
 6. Update W_i += 1e-16 Y_{i+1}
B=20000 N=20000 D=4 NB=5000
Allocating memory (28.3122 GBytes)...
Total memory allocated on RAM is 28.3122 GBytes

Initializing random data...
Initialization Complete
Starting the process
/usr/local/lib/libstarpu-1.4.so.1(+0xda530)[0x7fe27e93f530]
/usr/local/lib/libstarpu-1.4.so.1(+0xbe612)[0x7fe27e923612]
/usr/local/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x33e)[0x7fe27e9a8a4e]
/usr/local/lib/libstarpu-1.4.so.1(+0x14471d)[0x7fe27e9a971d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fe272efd609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fe27e1c6133]

[starpu][_starpu_redux_init_data_replicate][assert failure] There is no initialisation codelet for the reduction of the handle 0x558ee7e118b0. Maybe you forget to call starpu_data_set_reduction_methods() ?

a.out: datawizard/reduction.c:84: _starpu_redux_init_data_replicate: Assertion `0 && "init_cl"' failed.
Aborted

I must emphasize, that switching scheduling policy to eager or lws solves the problem. At least, it does not appear in my tests.

nfurmento commented 1 year ago

Could you please send the latest backtrace you get with the failure ?

Muxas commented 1 year ago

Sure, it is here https://gist.github.com/Muxas/96067b796087a49b7a91af3b89fa7064#file-backtrace-of-all-threads

nfurmento commented 1 year ago

I was able to reproduce the bug, and @sthibaul fixed it. It is working for me now, could you please try and let me know if it is also working for you now ?

nfurmento commented 1 year ago

you need to pull the latest commits either in the StarPU master branch or the 1.4 branch

Muxas commented 1 year ago

I tried your new commit 449dc7d9ef187bb5da8fab17914891fec6f6cda8 from https://gitlab.inria.fr/starpu/starpu.git and so far it works correctly! Thank you!

starpu-runtime / starpu

Question about invalidate_submit in an MPI environment #11