Closed Muxas closed 1 year ago
Assertion failed: (0 && "!worker->state_sched_op_pending"), function __starpu_datawizard_progress, file datawizard.c, line 130.
Could you get a backtrace? So we know the context of this assertion failure
Sure, here it is:
(lldb) thread backtrace all
thread #2, queue = 'com.apple.main-thread'
frame #0: 0x0000000102ac6c7c libstarpu-1.4.1.dylib`_starpu_fxt_get_job_id at fxt.h:284:2
frame #1: 0x0000000102ac8b00 libstarpu-1.4.1.dylib`_starpu_job_create(task=0x000000010012d400) at jobs.c:113:17
frame #2: 0x0000000102ac8e34 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task_slow(task=0x000000010012d400, job=0x0000000000000000) at jobs.c:159:10
frame #3: 0x0000000102b931f4 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task(task=0x000000010012d400) at task.h:67:9
frame #4: 0x0000000102b947dc libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback_acquired=0x0000000000000000, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1, pre_sync_jobid=0x0000000000000000, post_sync_jobid=0x0000000000000000, prio=0) at user_interactions.c:234:18
frame #5: 0x0000000102b94c4c libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_quick(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1) at user_interactions.c:290:9
frame #6: 0x0000000102b96b30 libstarpu-1.4.1.dylib`starpu_data_wont_use(handle=0x000000012d80d400) at user_interactions.c:733:2
frame #7: 0x0000000100dca9d0 nntile_core.so`___lldb_unnamed_symbol2620 + 48
frame #8: 0x0000000100dd7010 nntile_core.so`___lldb_unnamed_symbol3161 + 120
frame #9: 0x0000000100dff4ac nntile_core.so`___lldb_unnamed_symbol4970 + 68
frame #10: 0x0000000100df4870 nntile_core.so`___lldb_unnamed_symbol4548 + 68
frame #11: 0x0000000100de8d24 nntile_core.so`___lldb_unnamed_symbol3965 + 160
frame #12: 0x0000000100de8d90 nntile_core.so`___lldb_unnamed_symbol3967 + 24
frame #13: 0x0000000100db0ef4 nntile_core.so`___lldb_unnamed_symbol1837 + 3580
frame #14: 0x00000001006b8abc Python`cfunction_call + 60
frame #15: 0x0000000100664f3c Python`_PyObject_MakeTpCall + 128
frame #16: 0x000000010075e1c8 Python`_PyEval_EvalFrameDefault + 48236
frame #17: 0x0000000100751868 Python`PyEval_EvalCode + 280
frame #18: 0x00000001007b3dec Python`run_eval_code_obj + 84
frame #19: 0x00000001007b3d50 Python`run_mod + 112
frame #20: 0x00000001007b39f4 Python`pyrun_file + 148
frame #21: 0x00000001007b3328 Python`_PyRun_SimpleFileObject + 268
frame #22: 0x00000001007b29bc Python`_PyRun_AnyFileObject + 232
frame #23: 0x00000001007d677c Python`pymain_run_file_obj + 220
frame #24: 0x00000001007d5f04 Python`pymain_run_file + 72
frame #25: 0x00000001007d579c Python`Py_RunMain + 932
frame #26: 0x00000001007d6b64 Python`Py_BytesMain + 40
frame #27: 0x000000019713be50 dyld`start + 2544
thread #3
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #4
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #5
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #6
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #7
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #8
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #9
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #10
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #11
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
* thread #12, name = 'CPU 0', stop reason = hit program assert
frame #0: 0x000000019742e868 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x0000000197465cec libsystem_pthread.dylib`pthread_kill + 288
frame #2: 0x000000019739e2c8 libsystem_c.dylib`abort + 180
frame #3: 0x000000019739d620 libsystem_c.dylib`__assert_rtn + 272
* frame #4: 0x0000000102b79338 libstarpu-1.4.1.dylib`__starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC, push_requests=1) at datawizard.c:130:2
frame #5: 0x0000000102b794a8 libstarpu-1.4.1.dylib`_starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC) at datawizard.c:159:2
frame #6: 0x0000000102b90b70 libstarpu-1.4.1.dylib`_starpu_allocate_interface(handle=0x000000012d814200, replicate=0x000000012d814358, dst_node=0, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1604:3
frame #7: 0x0000000102b910d4 libstarpu-1.4.1.dylib`_starpu_allocate_memory_on_node(handle=0x000000012d814200, replicate=0x000000012d814358, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1665:21
frame #8: 0x0000000102b70044 libstarpu-1.4.1.dylib`_starpu_create_request_to_fetch_data(handle=0x000000012d814200, dst_replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:660:8
frame #9: 0x0000000102b70bc0 libstarpu-1.4.1.dylib`_starpu_fetch_data_on_node(handle=0x000000012d814200, node=0, dst_replicate=0x000000012d814358, mode=STARPU_W, detached=1, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:874:6
frame #10: 0x0000000102b70d7c libstarpu-1.4.1.dylib`task_prefetch_data_on_node(handle=0x000000012d814200, node=0, replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, prio=0) at coherency.c:897:9
frame #11: 0x0000000102b7173c libstarpu-1.4.1.dylib`_starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0, prefetch=STARPU_PREFETCH) at coherency.c:1020:4
frame #12: 0x0000000102b7181c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0) at coherency.c:1033:9
frame #13: 0x0000000102b71b08 libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for_prio(task=0x00000001687af9d0, worker=0, prio=0) at coherency.c:1070:9
frame #14: 0x0000000102b71b9c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for(task=0x00000001687af9d0, worker=0) at coherency.c:1078:9
frame #15: 0x0000000102b48548 libstarpu-1.4.1.dylib`push_task_on_best_worker(task=0x00000001687af9d0, best_workerid=0, predicted=2.0157669999999999, predicted_transfer=0, prio=0, sched_ctx_id=0) at deque_modeling_policy_data_aware.c:373:3
frame #16: 0x0000000102b49a0c libstarpu-1.4.1.dylib`_dmda_push_task(task=0x00000001687af9d0, prio=0, sched_ctx_id=0, da=1, simulate=0, sorted_decision=0) at deque_modeling_policy_data_aware.c:755:10
frame #17: 0x0000000102b49cc8 libstarpu-1.4.1.dylib`dmda_push_task(task=0x00000001687af9d0) at deque_modeling_policy_data_aware.c:789:9
frame #18: 0x0000000102b2d2b8 libstarpu-1.4.1.dylib`_starpu_push_task_to_workers(task=0x00000001687af9d0) at sched_policy.c:778:11
frame #19: 0x0000000102b2ca58 libstarpu-1.4.1.dylib`_starpu_repush_task(j=0x0000000169c63c00) at sched_policy.c:650:8
frame #20: 0x0000000102b2c010 libstarpu-1.4.1.dylib`_starpu_push_task(j=0x0000000169c63c00) at sched_policy.c:544:9
frame #21: 0x0000000102acbb10 libstarpu-1.4.1.dylib`_starpu_enforce_deps_starting_from_task(j=0x0000000169c63c00) at jobs.c:991:8
frame #22: 0x0000000102af7bd0 libstarpu-1.4.1.dylib`_starpu_notify_cg(pred=0x0000000169c61e00, cg=0x000060001739a800) at cg.c:277:6
frame #23: 0x0000000102af8084 libstarpu-1.4.1.dylib`_starpu_notify_cg_list(pred=0x0000000169c61e00, successors=0x0000000169c62020) at cg.c:377:3
frame #24: 0x0000000102b021d0 libstarpu-1.4.1.dylib`_starpu_notify_task_dependencies(j=0x0000000169c61e00) at task_deps.c:66:2
frame #25: 0x0000000102af8510 libstarpu-1.4.1.dylib`_starpu_notify_dependencies(j=0x0000000169c61e00) at dependencies.c:32:2
frame #26: 0x0000000102acaa58 libstarpu-1.4.1.dylib`_starpu_handle_job_termination(j=0x0000000169c61e00) at jobs.c:542:3
frame #27: 0x0000000102c1ff7c libstarpu-1.4.1.dylib`_starpu_cpu_driver_execute_task(cpu_worker=0x0000000102cdc748, task=0x00000001687aeee0, j=0x0000000169c61e00) at driver_cpu.c:576:3
frame #28: 0x0000000102c20138 libstarpu-1.4.1.dylib`_starpu_cpu_driver_run_once(cpu_worker=0x0000000102cdc748) at driver_cpu.c:614:9
frame #29: 0x0000000102c20890 libstarpu-1.4.1.dylib`_starpu_cpu_worker(arg=0x0000000102cdc748) at driver_cpu.c:732:3
frame #30: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #13, name = 'MPI'
frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8
frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
frame #2: 0x0000000100ed8c34 libstarpumpi-1.4.1.dylib`_starpu_mpi_progress_thread_func(arg=0x0000600000c00060) at starpu_mpi_mpi.c:1347:4
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #14
frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
thread #15
frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread
thread #16
frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
frame #2: 0x000000011a881824 libpmpi.12.dylib`sock_conn_listener_thread + 100
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
thread #17
frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8
frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252
frame #2: 0x000000011a875328 libpmpi.12.dylib`sock_ep_cm_thread + 484
frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
Update: I have experimented with the same code, but different schedulers. Normally, I rely on dmda
scheduler to take into account performance model. I must admit, that using STARPU_SCHED=eager
or STARPU_SCHED=lws
does not raise the issue. Whereas schedulers dmda
, dmdas
and dmdap
lead to the problem.
Another update: setting STARPU_SCHED=modular-heft
printed these debug info just before the assertion error:
[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?
[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?
Hope this will help you somehow
Sure, here it is:
(lldb) thread backtrace all thread #2, queue = 'com.apple.main-thread' frame #0: 0x0000000102ac6c7c libstarpu-1.4.1.dylib`_starpu_fxt_get_job_id at fxt.h:284:2 frame #1: 0x0000000102ac8b00 libstarpu-1.4.1.dylib`_starpu_job_create(task=0x000000010012d400) at jobs.c:113:17 frame #2: 0x0000000102ac8e34 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task_slow(task=0x000000010012d400, job=0x0000000000000000) at jobs.c:159:10 frame #3: 0x0000000102b931f4 libstarpu-1.4.1.dylib`_starpu_get_job_associated_to_task(task=0x000000010012d400) at task.h:67:9 frame #4: 0x0000000102b947dc libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback_acquired=0x0000000000000000, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1, pre_sync_jobid=0x0000000000000000, post_sync_jobid=0x0000000000000000, prio=0) at user_interactions.c:234:18 frame #5: 0x0000000102b94c4c libstarpu-1.4.1.dylib`starpu_data_acquire_on_node_cb_sequential_consistency_quick(handle=0x000000012d80d400, node=-2, mode=STARPU_R, callback=(libstarpu-1.4.1.dylib`_starpu_data_wont_use at user_interactions.c:655:1), arg=0x000000012d80d400, sequential_consistency=1, quick=1) at user_interactions.c:290:9 frame #6: 0x0000000102b96b30 libstarpu-1.4.1.dylib`starpu_data_wont_use(handle=0x000000012d80d400) at user_interactions.c:733:2 frame #7: 0x0000000100dca9d0 nntile_core.so`___lldb_unnamed_symbol2620 + 48 frame #8: 0x0000000100dd7010 nntile_core.so`___lldb_unnamed_symbol3161 + 120 frame #9: 0x0000000100dff4ac nntile_core.so`___lldb_unnamed_symbol4970 + 68 frame #10: 0x0000000100df4870 nntile_core.so`___lldb_unnamed_symbol4548 + 68 frame #11: 0x0000000100de8d24 nntile_core.so`___lldb_unnamed_symbol3965 + 160 frame #12: 0x0000000100de8d90 nntile_core.so`___lldb_unnamed_symbol3967 + 24 frame #13: 0x0000000100db0ef4 nntile_core.so`___lldb_unnamed_symbol1837 + 3580 frame #14: 0x00000001006b8abc Python`cfunction_call + 60 frame #15: 0x0000000100664f3c Python`_PyObject_MakeTpCall + 128 frame #16: 0x000000010075e1c8 Python`_PyEval_EvalFrameDefault + 48236 frame #17: 0x0000000100751868 Python`PyEval_EvalCode + 280 frame #18: 0x00000001007b3dec Python`run_eval_code_obj + 84 frame #19: 0x00000001007b3d50 Python`run_mod + 112 frame #20: 0x00000001007b39f4 Python`pyrun_file + 148 frame #21: 0x00000001007b3328 Python`_PyRun_SimpleFileObject + 268 frame #22: 0x00000001007b29bc Python`_PyRun_AnyFileObject + 232 frame #23: 0x00000001007d677c Python`pymain_run_file_obj + 220 frame #24: 0x00000001007d5f04 Python`pymain_run_file + 72 frame #25: 0x00000001007d579c Python`Py_RunMain + 932 frame #26: 0x00000001007d6b64 Python`Py_BytesMain + 40 frame #27: 0x000000019713be50 dyld`start + 2544 thread #3 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #4 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #5 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #6 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #7 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #8 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #9 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #10 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #11 frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000105e0c394 libopenblas64_.0.dylib`blas_thread_server + 360 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 * thread #12, name = 'CPU 0', stop reason = hit program assert frame #0: 0x000000019742e868 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x0000000197465cec libsystem_pthread.dylib`pthread_kill + 288 frame #2: 0x000000019739e2c8 libsystem_c.dylib`abort + 180 frame #3: 0x000000019739d620 libsystem_c.dylib`__assert_rtn + 272 * frame #4: 0x0000000102b79338 libstarpu-1.4.1.dylib`__starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC, push_requests=1) at datawizard.c:130:2 frame #5: 0x0000000102b794a8 libstarpu-1.4.1.dylib`_starpu_datawizard_progress(may_alloc=_STARPU_DATAWIZARD_DO_NOT_ALLOC) at datawizard.c:159:2 frame #6: 0x0000000102b90b70 libstarpu-1.4.1.dylib`_starpu_allocate_interface(handle=0x000000012d814200, replicate=0x000000012d814358, dst_node=0, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1604:3 frame #7: 0x0000000102b910d4 libstarpu-1.4.1.dylib`_starpu_allocate_memory_on_node(handle=0x000000012d814200, replicate=0x000000012d814358, is_prefetch=STARPU_TASK_PREFETCH, only_fast_alloc=0) at memalloc.c:1665:21 frame #8: 0x0000000102b70044 libstarpu-1.4.1.dylib`_starpu_create_request_to_fetch_data(handle=0x000000012d814200, dst_replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:660:8 frame #9: 0x0000000102b70bc0 libstarpu-1.4.1.dylib`_starpu_fetch_data_on_node(handle=0x000000012d814200, node=0, dst_replicate=0x000000012d814358, mode=STARPU_W, detached=1, task=0x00000001687af9d0, is_prefetch=STARPU_TASK_PREFETCH, async=1, callback_func=0x0000000000000000, callback_arg=0x0000000000000000, prio=0, origin="task_prefetch_data_on_node") at coherency.c:874:6 frame #10: 0x0000000102b70d7c libstarpu-1.4.1.dylib`task_prefetch_data_on_node(handle=0x000000012d814200, node=0, replicate=0x000000012d814358, mode=STARPU_W, task=0x00000001687af9d0, prio=0) at coherency.c:897:9 frame #11: 0x0000000102b7173c libstarpu-1.4.1.dylib`_starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0, prefetch=STARPU_PREFETCH) at coherency.c:1020:4 frame #12: 0x0000000102b7181c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_prio(task=0x00000001687af9d0, target_node=-1, worker=0, prio=0) at coherency.c:1033:9 frame #13: 0x0000000102b71b08 libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for_prio(task=0x00000001687af9d0, worker=0, prio=0) at coherency.c:1070:9 frame #14: 0x0000000102b71b9c libstarpu-1.4.1.dylib`starpu_prefetch_task_input_for(task=0x00000001687af9d0, worker=0) at coherency.c:1078:9 frame #15: 0x0000000102b48548 libstarpu-1.4.1.dylib`push_task_on_best_worker(task=0x00000001687af9d0, best_workerid=0, predicted=2.0157669999999999, predicted_transfer=0, prio=0, sched_ctx_id=0) at deque_modeling_policy_data_aware.c:373:3 frame #16: 0x0000000102b49a0c libstarpu-1.4.1.dylib`_dmda_push_task(task=0x00000001687af9d0, prio=0, sched_ctx_id=0, da=1, simulate=0, sorted_decision=0) at deque_modeling_policy_data_aware.c:755:10 frame #17: 0x0000000102b49cc8 libstarpu-1.4.1.dylib`dmda_push_task(task=0x00000001687af9d0) at deque_modeling_policy_data_aware.c:789:9 frame #18: 0x0000000102b2d2b8 libstarpu-1.4.1.dylib`_starpu_push_task_to_workers(task=0x00000001687af9d0) at sched_policy.c:778:11 frame #19: 0x0000000102b2ca58 libstarpu-1.4.1.dylib`_starpu_repush_task(j=0x0000000169c63c00) at sched_policy.c:650:8 frame #20: 0x0000000102b2c010 libstarpu-1.4.1.dylib`_starpu_push_task(j=0x0000000169c63c00) at sched_policy.c:544:9 frame #21: 0x0000000102acbb10 libstarpu-1.4.1.dylib`_starpu_enforce_deps_starting_from_task(j=0x0000000169c63c00) at jobs.c:991:8 frame #22: 0x0000000102af7bd0 libstarpu-1.4.1.dylib`_starpu_notify_cg(pred=0x0000000169c61e00, cg=0x000060001739a800) at cg.c:277:6 frame #23: 0x0000000102af8084 libstarpu-1.4.1.dylib`_starpu_notify_cg_list(pred=0x0000000169c61e00, successors=0x0000000169c62020) at cg.c:377:3 frame #24: 0x0000000102b021d0 libstarpu-1.4.1.dylib`_starpu_notify_task_dependencies(j=0x0000000169c61e00) at task_deps.c:66:2 frame #25: 0x0000000102af8510 libstarpu-1.4.1.dylib`_starpu_notify_dependencies(j=0x0000000169c61e00) at dependencies.c:32:2 frame #26: 0x0000000102acaa58 libstarpu-1.4.1.dylib`_starpu_handle_job_termination(j=0x0000000169c61e00) at jobs.c:542:3 frame #27: 0x0000000102c1ff7c libstarpu-1.4.1.dylib`_starpu_cpu_driver_execute_task(cpu_worker=0x0000000102cdc748, task=0x00000001687aeee0, j=0x0000000169c61e00) at driver_cpu.c:576:3 frame #28: 0x0000000102c20138 libstarpu-1.4.1.dylib`_starpu_cpu_driver_run_once(cpu_worker=0x0000000102cdc748) at driver_cpu.c:614:9 frame #29: 0x0000000102c20890 libstarpu-1.4.1.dylib`_starpu_cpu_worker(arg=0x0000000102cdc748) at driver_cpu.c:732:3 frame #30: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #13, name = 'MPI' frame #0: 0x000000019742984c libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x0000000197466638 libsystem_pthread.dylib`_pthread_cond_wait + 1232 frame #2: 0x0000000100ed8c34 libstarpumpi-1.4.1.dylib`_starpu_mpi_progress_thread_func(arg=0x0000600000c00060) at starpu_mpi_mpi.c:1347:4 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #14 frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread thread #15 frame #0: 0x0000000197460e18 libsystem_pthread.dylib`start_wqthread thread #16 frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8 frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252 frame #2: 0x000000011a881824 libpmpi.12.dylib`sock_conn_listener_thread + 100 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148 thread #17 frame #0: 0x000000019742e978 libsystem_kernel.dylib`poll + 8 frame #1: 0x000000011a854af4 libpmpi.12.dylib`ofi_pollfds_wait + 252 frame #2: 0x000000011a875328 libpmpi.12.dylib`sock_ep_cm_thread + 484 frame #3: 0x000000019746606c libsystem_pthread.dylib`_pthread_start + 148
Hi,
Thanks for the backtrace. I have pushed a fix through commit 11cec178a3e48fc821fa6935f2538c387e45774f to the master branch. It turns out the assert was excessive in this situation.
Best regards,
Olivier Aumage
I have tested commit 11cec178a3e48fc821fa6935f2538c387e45774f and can confirm that it now does not stop program execution and results are stable. I am closing the issue, as it is resolved. Thank you!
Another update: setting
STARPU_SCHED=modular-heft
printed these debug info just before the assertion error:[starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks? [starpu][...][_starpu_worker_task_list_finished] warning: bogus computation of pipeline_ntasks?
Hope this will help you somehow
It was unrelated, but I have fixed it now, thanks!
Dear StarPU team,
I experience some internal StarPU assertion failure. The long error message is the following:
Assertion failed: (0 && "!worker->state_sched_op_pending"), function __starpu_datawizard_progress, file datawizard.c, line 130.
Unfortunately, I cannot put here any minimal working example to repeat the problem easily. However, I will describe as much as possible.STARPU_NCPU=2
but due to using MPI init routinestarpu_mpi_init_conf
I get only 1 CPU worker, which is confirmed bystarpu_worker_get_count_by_type(STARPU_CPU_WORKER)
function call.starpu_variable_data_register
is used) are generated and filled with data before adding lots of tasks into the pool of tasks.STARPU_DISABLE_KERNELS=1
.From the last point of the above mentioned list, it is naive to suspect that adding tasks to the DAG in the main thread is not safe for worker threads. Further I show some timings to understand how large my DAG is:
The first block shows normal task execution (kernels are enabled). The main thread submits all tasks in between pausing and resuming worker threads:
The second block shows fake task execution (kernels are disabled). The main thread submits all tasks in between pausing and resuming worker threads:
As can be seen, real computations within kernels are very fast, while all the work is with push and pop operations of the DAG. Adding all the task to the DAG requires around 14 seconds of the main thread! I hope this info helps you find the root of the problem. But if you need more info, I would be happy to share with you.