starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
63 stars 12 forks source link

Assertion failed: (0 && "!worker->state_sched_op_pending") #6

Closed Muxas closed 1 year ago

Muxas commented 1 year ago

Dear StarPU team,

I experience some internal StarPU assertion failure. The long error message is the following: Assertion failed: (0 && "!worker->state_sched_op_pending"), function __starpu_datawizard_progress, file datawizard.c, line 130. Unfortunately, I cannot put here any minimal working example to repeat the problem easily. However, I will describe as much as possible.

From the last point of the above mentioned list, it is naive to suspect that adding tasks to the DAG in the main thread is not safe for worker threads. Further I show some timings to understand how large my DAG is:

The first block shows normal task execution (kernels are enabled). The main thread submits all tasks in between pausing and resuming worker threads:

STARPU_NCPU=3 python3.11 wrappers/python/tests/model/test_deep_linear.py
[starpu][mbpro-skoltech-muxas.local][starpu_initialize] Warning: StarPU was configured with --enable-debug (-O0), and is thus not optimized
[starpu][mbpro-skoltech-muxas.local][starpu_initialize] Warning: StarPU was configured with --enable-spinlock-check, which slows down a bit
[starpu][mbpro-skoltech-muxas.local][_starpu_init_topology] Warning: there are several kinds of CPU on this system. For now StarPU assumes all CPU are equal
[starpu][mbpro-skoltech-muxas.local][_starpu_init_topology] Warning: could not get current CPU binding: Function not implemented
[starpu][mbpro-skoltech-muxas.local][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
Initialized NCPU=2 NCUDA=0
Finish generating in 0.0550537109375 seconds
Finish random weights init in 0.0007817745208740234 seconds
Finish adding tasks in 14.299952030181885 seconds
Done in 34.322351932525635 seconds

The second block shows fake task execution (kernels are disabled). The main thread submits all tasks in between pausing and resuming worker threads:

STARPU_NCPU=3 STARPU_DISABLE_KERNELS=1 python3.11 wrappers/python/tests/model/test_deep_linear.py
[starpu][mbpro-skoltech-muxas.local][starpu_initialize] Warning: StarPU was configured with --enable-debug (-O0), and is thus not optimized
[starpu][mbpro-skoltech-muxas.local][starpu_initialize] Warning: StarPU was configured with --enable-spinlock-check, which slows down a bit
[starpu][mbpro-skoltech-muxas.local][_starpu_init_topology] Warning: there are several kinds of CPU on this system. For now StarPU assumes all CPU are equal
[starpu][mbpro-skoltech-muxas.local][_starpu_init_topology] Warning: could not get current CPU binding: Function not implemented
[starpu][mbpro-skoltech-muxas.local][_starpu_mpi_print_thread_level_support] MPI_Init_thread level = MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
Initialized NCPU=2 NCUDA=0
Finish generating in 0.060636281967163086 seconds
Finish random weights init in 0.0018548965454101562 seconds
Finish adding tasks in 14.241771936416626 seconds
Done in 33.04393792152405 seconds

As can be seen, real computations within kernels are very fast, while all the work is with push and pop operations of the DAG. Adding all the task to the DAG requires around 14 seconds of the main thread! I hope this info helps you find the root of the problem. But if you need more info, I would be happy to share with you.

sthibaul commented 1 year ago

Assertion failed: (0 && "!worker->state_sched_op_pending"), function __starpu_datawizard_progress, file datawizard.c, line 130.

Could you get a backtrace? So we know the context of this assertion failure

Muxas commented 1 year ago

Sure, here it is:

(lldb) thread backtrace all
  thread #2, queue = 'com.apple.main-thread'
    frame #0: 0x0000000102ac6c7c libstarpu-1.4.1.dylib`_starpu_fxt_get_job_id at fxt.h:284:2
    frame #1: 0x0000000102ac8b00 libstarpu-1.4.1.dylib`_starpu_job_create(task=0x000000010012d400) at jobs.c:113:17
    frame #2: 0x0000000102ac8e34 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task_slow(task=0x000000010012d400, job=0x0000000000000000) at jobs.c:159:10
    frame #3: 0x0000000102b931f4 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task(task=0x000000010012d400) at task.h:67:9
    frame #4: 0x0000000102b947dc libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback_acquired=0x0000000000000000, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1, pre_sync_jobid=0x0000000000000000, post_sync_jobid=0x0000000000000000, prio=0) at user_interactions.c:234:18
    frame #5: 0x0000000102b94c4c libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_quick(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1) at user_interactions.c:290:9
    frame #6: 0x0000000102b96b30 libstarpu-1.4.1.dylib`starpu_data_wont_use(handle=0x000000012d80d400) at user_interactions.c:733:2
    frame #7: 0x0000000100dca9d0 nntile_core.so`___lldb_unnamed_symbol2620 + 48
    frame #8: 0x0000000100dd7010 nntile_core.so`___lldb_unnamed_symbol3161 + 120
    frame #9: 0x0000000100dff4ac nntile_core.so`___lldb_unnamed_symbol4970 + 68
    frame #10: 0x0000000100df4870 nntile_core.so`___lldb_unnamed_symbol4548 + 68
    frame #11: 0x0000000100de8d24 nntile_core.so`___lldb_unnamed_symbol3965 + 160
    frame #12: 0x0000000100de8d90 nntile_core.so`___lldb_unnamed_symbol3967 + 24
    frame #13: 0x0000000100db0ef4 nntile_core.so`___lldb_unnamed_symbol1837 + 3580
    frame #14: 0x00000001006b8abc Python`cfunction_call + 60
    frame #15: 0x0000000100664f3c Python`_PyObject_MakeTpCall + 128
    frame #16: 0x000000010075e1c8 Python`_PyEval_EvalFrameDefault + 48236
    frame #17: 0x0000000100751868 Python`PyEval_EvalCode + 280
    frame #18: 0x00000001007b3dec Python`run_eval_code_obj + 84
    frame #19: 0x00000001007b3d50 Python`run_mod + 112
    frame #20: 0x00000001007b39f4 Python`pyrun_file + 148
    frame #21: 0x00000001007b3328 Python`_PyRun_SimpleFileObject + 268
    frame #22: 0x00000001007b29bc Python`_PyRun_AnyFileObject + 232
    frame #23: 0x00000001007d677c Python`pymain_run_file_obj + 220
    frame #24: 0x00000001007d5f04 Python`pymain_run_file + 72
    frame #25: 0x00000001007d579c Python`Py_RunMain + 932
    frame #26: 0x00000001007d6b64 Python`Py_BytesMain + 40
    frame #27: 0x000000019713be50 dyld`start + 2544
  thread #3
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #4
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #5
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #6
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #7
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #8
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #9
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #10
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #11
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
* thread #12, name = 'CPU 0', stop reason = hit program assert
    frame #0: 0x000000019742e868 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000197465cec libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000019739e2c8 libsystem_c.dylib`abort + 180
    frame #3: 0x000000019739d620 libsystem_c.dylib`__assert_rtn + 272
  * frame #4: 0x0000000102b79338 libstarpu-1.4.1.dylib`__starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC, push_requests=1) at datawizard.c:130:2
    frame #5: 0x0000000102b794a8 libstarpu-1.4.1.dylib`_starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC) at datawizard.c:159:2
    frame #6: 0x0000000102b90b70 libstarpu-1.4.1.dylib`_starpu_allocate_interface(handle=0x000000012d814200, replicate=0x000000012d814358, dst_node=0, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1604:3
    frame #7: 0x0000000102b910d4 libstarpu-1.4.1.dylib`_starpu_allocate_memory_on_node(handle=0x000000012d814200, replicate=0x000000012d814358, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1665:21
    frame #8: 0x0000000102b70044 libstarpu-1.4.1.dylib`_starpu_create_request_to_fetch_data(handle=0x000000012d814200, dst_replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:660:8
    frame #9: 0x0000000102b70bc0 libstarpu-1.4.1.dylib`_starpu_fetch_data_on_node(handle=0x000000012d814200, node=0, dst_replicate=0x000000012d814358, mode=STARPU_W, detached=1, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:874:6
    frame #10: 0x0000000102b70d7c libstarpu-1.4.1.dylib`task_prefetch_data_on_node(handle=0x000000012d814200, node=0, replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, prio=0) at coherency.c:897:9
    frame #11: 0x0000000102b7173c libstarpu-1.4.1.dylib`_starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0, prefetch=STARPU_PREFETCH) at coherency.c:1020:4
    frame #12: 0x0000000102b7181c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0) at coherency.c:1033:9
    frame #13: 0x0000000102b71b08 libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for_prio(task=0x00000001687af9d0, worker=0, prio=0) at coherency.c:1070:9
    frame #14: 0x0000000102b71b9c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for(task=0x00000001687af9d0, worker=0) at coherency.c:1078:9
    frame #15: 0x0000000102b48548 libstarpu-1.4.1.dylib`push_task_on_best_worker(task=0x00000001687af9d0, best_workerid=0, predicted=2.0157669999999999, predicted_transfer=0, prio=0, sched_ctx_id=0) at deque_modeling_policy_data_aware.c:373:3
    frame #16: 0x0000000102b49a0c libstarpu-1.4.1.dylib`_dmda_push_task(task=0x00000001687af9d0, prio=0, sched_ctx_id=0, da=1, simulate=0, sorted_decision=0) at deque_modeling_policy_data_aware.c:755:10
    frame #17: 0x0000000102b49cc8 libstarpu-1.4.1.dylib`dmda_push_task(task=0x00000001687af9d0) at deque_modeling_policy_data_aware.c:789:9
    frame #18: 0x0000000102b2d2b8 libstarpu-1.4.1.dylib`_starpu_push_task_to_workers(task=0x00000001687af9d0) at sched_policy.c:778:11
    frame #19: 0x0000000102b2ca58 libstarpu-1.4.1.dylib`_starpu_repush_task(j=0x0000000169c63c00) at sched_policy.c:650:8
    frame #20: 0x0000000102b2c010 libstarpu-1.4.1.dylib`_starpu_push_task(j=0x0000000169c63c00) at sched_policy.c:544:9
    frame #21: 0x0000000102acbb10 libstarpu-1.4.1.dylib`_starpu_enforce_deps_starting_from_task(j=0x0000000169c63c00) at jobs.c:991:8
    frame #22: 0x0000000102af7bd0 libstarpu-1.4.1.dylib`_starpu_notify_cg(pred=0x0000000169c61e00, cg=0x000060001739a800) at cg.c:277:6
    frame #23: 0x0000000102af8084 libstarpu-1.4.1.dylib`_starpu_notify_cg_list(pred=0x0000000169c61e00, successors=0x0000000169c62020) at cg.c:377:3
    frame #24: 0x0000000102b021d0 libstarpu-1.4.1.dylib`_starpu_notify_task_dependencies(j=0x0000000169c61e00) at task_deps.c:66:2
    frame #25: 0x0000000102af8510 libstarpu-1.4.1.dylib`_starpu_notify_dependencies(j=0x0000000169c61e00) at dependencies.c:32:2
    frame #26: 0x0000000102acaa58 libstarpu-1.4.1.dylib`_starpu_handle_job_termination(j=0x0000000169c61e00) at jobs.c:542:3
    frame #27: 0x0000000102c1ff7c libstarpu-1.4.1.dylib`_starpu_cpu_driver_execute_task(cpu_worker=0x0000000102cdc748, task=0x00000001687aeee0, j=0x0000000169c61e00) at driver_cpu.c:576:3
    frame #28: 0x0000000102c20138 libstarpu-1.4.1.dylib`_starpu_cpu_driver_run_once(cpu_worker=0x0000000102cdc748) at driver_cpu.c:614:9
    frame #29: 0x0000000102c20890 libstarpu-1.4.1.dylib`_starpu_cpu_worker(arg=0x0000000102cdc748) at driver_cpu.c:732:3
    frame #30: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #13, name = 'MPI'
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000100ed8c34 libstarpumpi-1.4.1.dylib`_starpu_mpi_progress_thread_func(arg=0x0000600000c00060) at starpu_mpi_mpi.c:1347:4
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #14
    frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
  thread #15
    frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
  thread #16
    frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
    frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
    frame #2: 0x000000011a881824 libpmpi.12.dylib`sock_conn_listener_thread + 100
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #17
    frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
    frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
    frame #2: 0x000000011a875328 libpmpi.12.dylib`sock_ep_cm_thread + 484
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
Muxas commented 1 year ago

Update: I have experimented with the same code, but different schedulers. Normally, I rely on dmda scheduler to take into account performance model. I must admit, that using STARPU_SCHED=eager or STARPU_SCHED=lws does not raise the issue. Whereas schedulers dmda, dmdas and dmdap lead to the problem.

Muxas commented 1 year ago

Another update: setting STARPU_SCHED=modular-heft printed these debug info just before the assertion error:

[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?
[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?

Hope this will help you somehow

oaumage commented 1 year ago

Sure, here it is:

(lldb) thread backtrace all
  thread #2, queue = 'com.apple.main-thread'
    frame #0: 0x0000000102ac6c7c libstarpu-1.4.1.dylib`_starpu_fxt_get_job_id at fxt.h:284:2
    frame #1: 0x0000000102ac8b00 libstarpu-1.4.1.dylib`_starpu_job_create(task=0x000000010012d400) at jobs.c:113:17
    frame #2: 0x0000000102ac8e34 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task_slow(task=0x000000010012d400, job=0x0000000000000000) at jobs.c:159:10
    frame #3: 0x0000000102b931f4 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task(task=0x000000010012d400) at task.h:67:9
    frame #4: 0x0000000102b947dc libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback_acquired=0x0000000000000000, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1, pre_sync_jobid=0x0000000000000000, post_sync_jobid=0x0000000000000000, prio=0) at user_interactions.c:234:18
    frame #5: 0x0000000102b94c4c libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_quick(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1) at user_interactions.c:290:9
    frame #6: 0x0000000102b96b30 libstarpu-1.4.1.dylib`starpu_data_wont_use(handle=0x000000012d80d400) at user_interactions.c:733:2
    frame #7: 0x0000000100dca9d0 nntile_core.so`___lldb_unnamed_symbol2620 + 48
    frame #8: 0x0000000100dd7010 nntile_core.so`___lldb_unnamed_symbol3161 + 120
    frame #9: 0x0000000100dff4ac nntile_core.so`___lldb_unnamed_symbol4970 + 68
    frame #10: 0x0000000100df4870 nntile_core.so`___lldb_unnamed_symbol4548 + 68
    frame #11: 0x0000000100de8d24 nntile_core.so`___lldb_unnamed_symbol3965 + 160
    frame #12: 0x0000000100de8d90 nntile_core.so`___lldb_unnamed_symbol3967 + 24
    frame #13: 0x0000000100db0ef4 nntile_core.so`___lldb_unnamed_symbol1837 + 3580
    frame #14: 0x00000001006b8abc Python`cfunction_call + 60
    frame #15: 0x0000000100664f3c Python`_PyObject_MakeTpCall + 128
    frame #16: 0x000000010075e1c8 Python`_PyEval_EvalFrameDefault + 48236
    frame #17: 0x0000000100751868 Python`PyEval_EvalCode + 280
    frame #18: 0x00000001007b3dec Python`run_eval_code_obj + 84
    frame #19: 0x00000001007b3d50 Python`run_mod + 112
    frame #20: 0x00000001007b39f4 Python`pyrun_file + 148
    frame #21: 0x00000001007b3328 Python`_PyRun_SimpleFileObject + 268
    frame #22: 0x00000001007b29bc Python`_PyRun_AnyFileObject + 232
    frame #23: 0x00000001007d677c Python`pymain_run_file_obj + 220
    frame #24: 0x00000001007d5f04 Python`pymain_run_file + 72
    frame #25: 0x00000001007d579c Python`Py_RunMain + 932
    frame #26: 0x00000001007d6b64 Python`Py_BytesMain + 40
    frame #27: 0x000000019713be50 dyld`start + 2544
  thread #3
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #4
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #5
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #6
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #7
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #8
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #9
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #10
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #11
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
* thread #12, name = 'CPU 0', stop reason = hit program assert
    frame #0: 0x000000019742e868 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000197465cec libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x000000019739e2c8 libsystem_c.dylib`abort + 180
    frame #3: 0x000000019739d620 libsystem_c.dylib`__assert_rtn + 272
  * frame #4: 0x0000000102b79338 libstarpu-1.4.1.dylib`__starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC, push_requests=1) at datawizard.c:130:2
    frame #5: 0x0000000102b794a8 libstarpu-1.4.1.dylib`_starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC) at datawizard.c:159:2
    frame #6: 0x0000000102b90b70 libstarpu-1.4.1.dylib`_starpu_allocate_interface(handle=0x000000012d814200, replicate=0x000000012d814358, dst_node=0, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1604:3
    frame #7: 0x0000000102b910d4 libstarpu-1.4.1.dylib`_starpu_allocate_memory_on_node(handle=0x000000012d814200, replicate=0x000000012d814358, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1665:21
    frame #8: 0x0000000102b70044 libstarpu-1.4.1.dylib`_starpu_create_request_to_fetch_data(handle=0x000000012d814200, dst_replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:660:8
    frame #9: 0x0000000102b70bc0 libstarpu-1.4.1.dylib`_starpu_fetch_data_on_node(handle=0x000000012d814200, node=0, dst_replicate=0x000000012d814358, mode=STARPU_W, detached=1, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:874:6
    frame #10: 0x0000000102b70d7c libstarpu-1.4.1.dylib`task_prefetch_data_on_node(handle=0x000000012d814200, node=0, replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, prio=0) at coherency.c:897:9
    frame #11: 0x0000000102b7173c libstarpu-1.4.1.dylib`_starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0, prefetch=STARPU_PREFETCH) at coherency.c:1020:4
    frame #12: 0x0000000102b7181c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0) at coherency.c:1033:9
    frame #13: 0x0000000102b71b08 libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for_prio(task=0x00000001687af9d0, worker=0, prio=0) at coherency.c:1070:9
    frame #14: 0x0000000102b71b9c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for(task=0x00000001687af9d0, worker=0) at coherency.c:1078:9
    frame #15: 0x0000000102b48548 libstarpu-1.4.1.dylib`push_task_on_best_worker(task=0x00000001687af9d0, best_workerid=0, predicted=2.0157669999999999, predicted_transfer=0, prio=0, sched_ctx_id=0) at deque_modeling_policy_data_aware.c:373:3
    frame #16: 0x0000000102b49a0c libstarpu-1.4.1.dylib`_dmda_push_task(task=0x00000001687af9d0, prio=0, sched_ctx_id=0, da=1, simulate=0, sorted_decision=0) at deque_modeling_policy_data_aware.c:755:10
    frame #17: 0x0000000102b49cc8 libstarpu-1.4.1.dylib`dmda_push_task(task=0x00000001687af9d0) at deque_modeling_policy_data_aware.c:789:9
    frame #18: 0x0000000102b2d2b8 libstarpu-1.4.1.dylib`_starpu_push_task_to_workers(task=0x00000001687af9d0) at sched_policy.c:778:11
    frame #19: 0x0000000102b2ca58 libstarpu-1.4.1.dylib`_starpu_repush_task(j=0x0000000169c63c00) at sched_policy.c:650:8
    frame #20: 0x0000000102b2c010 libstarpu-1.4.1.dylib`_starpu_push_task(j=0x0000000169c63c00) at sched_policy.c:544:9
    frame #21: 0x0000000102acbb10 libstarpu-1.4.1.dylib`_starpu_enforce_deps_starting_from_task(j=0x0000000169c63c00) at jobs.c:991:8
    frame #22: 0x0000000102af7bd0 libstarpu-1.4.1.dylib`_starpu_notify_cg(pred=0x0000000169c61e00, cg=0x000060001739a800) at cg.c:277:6
    frame #23: 0x0000000102af8084 libstarpu-1.4.1.dylib`_starpu_notify_cg_list(pred=0x0000000169c61e00, successors=0x0000000169c62020) at cg.c:377:3
    frame #24: 0x0000000102b021d0 libstarpu-1.4.1.dylib`_starpu_notify_task_dependencies(j=0x0000000169c61e00) at task_deps.c:66:2
    frame #25: 0x0000000102af8510 libstarpu-1.4.1.dylib`_starpu_notify_dependencies(j=0x0000000169c61e00) at dependencies.c:32:2
    frame #26: 0x0000000102acaa58 libstarpu-1.4.1.dylib`_starpu_handle_job_termination(j=0x0000000169c61e00) at jobs.c:542:3
    frame #27: 0x0000000102c1ff7c libstarpu-1.4.1.dylib`_starpu_cpu_driver_execute_task(cpu_worker=0x0000000102cdc748, task=0x00000001687aeee0, j=0x0000000169c61e00) at driver_cpu.c:576:3
    frame #28: 0x0000000102c20138 libstarpu-1.4.1.dylib`_starpu_cpu_driver_run_once(cpu_worker=0x0000000102cdc748) at driver_cpu.c:614:9
    frame #29: 0x0000000102c20890 libstarpu-1.4.1.dylib`_starpu_cpu_worker(arg=0x0000000102cdc748) at driver_cpu.c:732:3
    frame #30: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #13, name = 'MPI'
    frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x0000000100ed8c34 libstarpumpi-1.4.1.dylib`_starpu_mpi_progress_thread_func(arg=0x0000600000c00060) at starpu_mpi_mpi.c:1347:4
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #14
    frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
  thread #15
    frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
  thread #16
    frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
    frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
    frame #2: 0x000000011a881824 libpmpi.12.dylib`sock_conn_listener_thread + 100
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
  thread #17
    frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
    frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
    frame #2: 0x000000011a875328 libpmpi.12.dylib`sock_ep_cm_thread + 484
    frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148

Hi,

Thanks for the backtrace. I have pushed a fix through commit 11cec178a3e48fc821fa6935f2538c387e45774f to the master branch. It turns out the assert was excessive in this situation.

Best regards,

Olivier Aumage

Muxas commented 1 year ago

I have tested commit 11cec178a3e48fc821fa6935f2538c387e45774f and can confirm that it now does not stop program execution and results are stable. I am closing the issue, as it is resolved. Thank you!

sthibaul commented 1 year ago

Another update: setting STARPU_SCHED=modular-heft printed these debug info just before the assertion error:

[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?
[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?

Hope this will help you somehow

It was unrelated, but I have fixed it now, thanks!