Open Muxas opened 8 months ago
I have broken this problem down to a simple example:
#include <starpu.h>
#include <cuda.h>
void copy_func(void *buffers[], void *cl_args)
{
float *x = (float *)STARPU_VARIABLE_GET_PTR(buffers[0]);
float *y = (float *)STARPU_VARIABLE_GET_PTR(buffers[1]);
cudaStream_t stream = starpu_cuda_get_local_stream();
cudaMemcpyAsync(y, x, sizeof(float), cudaMemcpyDeviceToDevice, stream);
printf("copy_func\n");
}
struct starpu_codelet copy_codelet =
{
.cuda_funcs = {copy_func},
.cuda_flags = {STARPU_CUDA_ASYNC},
.modes = {STARPU_R, STARPU_W},
.nbuffers = 2
};
int main(int argc, char **argv)
{
float *ptr;
starpu_data_handle_t x_handle, y_handle;
starpu_init(NULL);
starpu_variable_data_register(&x_handle, -1, 0, sizeof(float));
starpu_variable_data_register(&y_handle, -1, 0, sizeof(float));
starpu_data_acquire(x_handle, STARPU_W);
ptr = (float *)starpu_data_get_local_ptr(x_handle);
*ptr = 1.0;
starpu_data_release(x_handle);
starpu_task_insert(©_codelet, STARPU_R, x_handle, STARPU_W, y_handle, 0);
starpu_task_wait_for_all();
starpu_data_unregister(x_handle);
starpu_shutdown();
}
Creating X
on CPU and using it on GPU to copy from X
into Y
leads to assert errors of the DARTS scheduler. Running this example sometimes goes without errors and sometimes leads the described problem:
STARPU_NCPU=1 STARPU_NCUDA=1 STARPU_SCHED=darts ./a.out
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x11c46a)[0x15555501446a]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_push_task+0x2a)[0x15555501166a]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pump_to+0x79)[0x155555011a29]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x13b1d9)[0x1555550331d9]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_send_can_push_to_parents+0x4b)[0x155555012cbb]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x11b99b)[0x15555501399b]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x1555550117c6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x129607)[0x155555021607]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x1555550117c6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x116838)[0x15555500e838]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x1555550117c6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x86e95)[0x155554f7ee95]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0xb76ee)[0x155554faf6ee]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(_starpu_cpu_driver_run_once+0x8c)[0x15555503a4ec]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x142c35)[0x15555503ac35]
/lib64/libpthread.so.0(+0x7ea5)[0x155554ce3ea5]
/lib64/libc.so.6(clone+0x6d)[0x1555544ecb0d]
a.out: ../../src/sched_policies/component_fifo.c:89: fifo_push_local_task: Assertion `0 && "starpu_sched_component_can_execute_task(component,task)"' failed.
Aborted
Even simpler example when DARTS fails with asserts:
#include <starpu.h>
#include <cuda.h>
void clear_func(void *buffers[], void *cl_args)
{
float *x = (float *)STARPU_VARIABLE_GET_PTR(buffers[0]);
cudaStream_t stream = starpu_cuda_get_local_stream();
cudaMemsetAsync(x, 0, sizeof(float), stream);
printf("clear_func\n");
}
struct starpu_codelet clear_codelet =
{
.cuda_funcs = {clear_func},
.cuda_flags = {STARPU_CUDA_ASYNC},
.modes = {STARPU_W},
.nbuffers = 1
};
int main(int argc, char **argv)
{
float *ptr;
starpu_data_handle_t x_handle[10];
starpu_init(NULL);
for(int i = 0; i < 10; ++i)
{
starpu_variable_data_register(&x_handle[i], -1, 0, sizeof(float));
}
for(int i = 0; i < 10; ++i)
{
starpu_task_insert(&clear_codelet, STARPU_W, x_handle[i], 0);
}
starpu_task_wait_for_all();
for(int i = 0; i < 10; ++i)
{
starpu_data_unregister(x_handle[i]);
}
starpu_shutdown();
}
@MaximeGonthier: could you have a look? The problem is not calling starpu_worker_can_execute_task_first_impl
to check that the task can be run by the PU you are aiming. That'll have to be done when picking up a random task at the beginning, when selecting a "good data", and when pushing all the "free tasks": only those than can actually be executed there should be pushed
(@Muxas' testcases can be really interesting for DARTS, and most starpu testsuite failures and currently due to this issue)
Thank you for the examples. I'll work on it as soon as possible. I'll keep you updated
@Muxas The Assertion
0 && "starpu_sched_component_can_execute_task(component,task)"'` error has been resolved for the DARTS scheduler on the master branch.
Please let me know if this issue arise again or if any other issue with DARTS is found.
Trying latest 882aba682cec925bf6bd226c210641bd80b0795d commit shows there is still a problem. I tried this example:
#include <starpu.h>
#include <cuda.h>
void clear_func(void *buffers[], void *cl_args)
{
float *x = (float *)STARPU_VARIABLE_GET_PTR(buffers[0]);
cudaStream_t stream = starpu_cuda_get_local_stream();
cudaMemsetAsync(x, 0, sizeof(float), stream);
printf("clear_func\n");
}
struct starpu_codelet clear_codelet =
{
.cuda_funcs = {clear_func},
.cuda_flags = {STARPU_CUDA_ASYNC},
.modes = {STARPU_W},
.nbuffers = 1
};
int main(int argc, char **argv)
{
float *ptr;
starpu_data_handle_t x_handle[10];
starpu_init(NULL);
for(int i = 0; i < 10; ++i)
{
starpu_variable_data_register(&x_handle[i], -1, 0, sizeof(float));
}
for(int i = 0; i < 10; ++i)
{
starpu_task_insert(&clear_codelet, STARPU_W, x_handle[i], 0);
}
starpu_task_wait_for_all();
for(int i = 0; i < 10; ++i)
{
starpu_data_unregister(x_handle[i]);
}
starpu_shutdown();
}
Here is a config.log
The error itself:
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0xbbbd1)[0x155554fb3bd1]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_data_expected_transfer_time+0x83)[0x155554f774e3]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x137453)[0x15555502f453]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x13d323)[0x155555035323]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pump_to+0x63)[0x155555012023]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x13c3c5)[0x1555550343c5]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_send_can_push_to_parents+0x4b)[0x1555550132cb]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x11bfab)[0x155555013fab]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x155555011dd6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x129c17)[0x155555021c17]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x155555011dd6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x116e48)[0x15555500ee48]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(starpu_sched_component_pull_task+0x16)[0x155555011dd6]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x87125)[0x155554f7f125]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0xb797e)[0x155554faf97e]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(_starpu_cpu_driver_run_once+0x8c)[0x15555503a89c]
/trinity/home/al.mikhalev/Install/starpu-1.5-a100/lib/libstarpu-1.4.so.1(+0x142fe5)[0x15555503afe5]
/lib64/libpthread.so.0(+0x7ea5)[0x155554ce3ea5]
/lib64/libc.so.6(clone+0x6d)[0x1555544ecb0d]
[starpu][_starpu_select_src_node][assert failure] The data for the handle 0xaa9d50 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?
a2.out: ../../src/datawizard/coherency.c:69: _starpu_select_src_node: Assertion `0 && "src_node_mask != 0"' failed.
Full backtrace:
(gdb) bt full
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames:
#0 0x0000155554424387 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x0000155554425a78 in abort () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
#2 0x000015555441d1a6 in __assert_fail_base () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
#3 0x000015555441d252 in __assert_fail () from /lib64/libc.so.6
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
#4 0x0000155554fb3c21 in _starpu_select_src_node (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
handle=handle@entry=0xaa9d50, destination=destination@entry=0)
at ../../src/datawizard/coherency.c:69
src_node = -1
i = <optimized out>
nnodes = 2
node = <optimized out>
size = 4
cost = inf
src_node_mask = <optimized out>
__func__ = "_starpu_select_src_node"
__PRETTY_FUNCTION__ = "_starpu_select_src_node"
i_ram = <optimized out>
i_gpu = <optimized out>
i_disk = <optimized out>
#5 0x0000155554f774e3 in starpu_data_expected_transfer_time (mode=STARPU_R, memory_node=0, handle=0xaa9d50)
at ../../src/core/perfmodel/perfmodel.c:458
size = 4
duration = 0
src_node = <optimized out>
size = <optimized out>
duration = <optimized out>
src_node = <optimized out>
#6 starpu_data_expected_transfer_time (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
handle=0xaa9d50, memory_node=memory_node@entry=0, mode=mode@entry=STARPU_R)
at ../../src/core/perfmodel/perfmodel.c:433
size = <optimized out>
duration = <optimized out>
src_node = <optimized out>
#7 0x000015555502f453 in _starpu_darts_scheduling_3D_matrix (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
main_task_list=main_task_list@entry=0xa91ba0,
--Type <RET> for more, q to quit, c to continue without paging--
current_gpu=current_gpu@entry=0, g=0xa91bf0, current_worker_id=<optimized out>) at ../../src/sched_policies/darts.c:1889
i = <optimized out>
e = 0xaa2570
remaining_expected_length_max = 0
best_1_from_free_task = 0x0
temp_best_1_from_free_task = <optimized out>
number_free_task_max = 0
temp_number_free_task_max = <optimized out>
number_1_from_free_task_max = 0
temp_number_1_from_free_task_max = <optimized out>
priority_max = -2147483648
temp_priority_max = <optimized out>
transfer_time_min = 1.7976931348623157e+308
temp_transfer_time_min = <optimized out>
ratio_transfertime_freetask_min = 1.7976931348623157e+308
temp_length_free_tasks_max = <optimized out>
handle_popped = 0x0
hud = <optimized out>
data_chosen_index = 0
__PRETTY_FUNCTION__ = "_starpu_darts_scheduling_3D_matrix"
__func__ = "_starpu_darts_scheduling_3D_matrix"
data_not_available = <optimized out>
data_available = true
choose_best_data_threshold = 2147483647
i = <optimized out>
#8 0x0000155555035323 in get_task_to_return_pull_task_darts (current_worker_id=<optimized out>, l=0xa91ba0, current_gpu=0)
at ../../src/sched_policies/darts.c:2531
task = <optimized out>
__func__ = "get_task_to_return_pull_task_darts"
__PRETTY_FUNCTION__ = "get_task_to_return_pull_task_darts"
task = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
task = <optimized out>
i = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--
hud = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
task = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
i = <optimized out>
hud = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
#9 darts_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=<optimized out>, to=<optimized out>) at ../../src/sched_policies/darts.c:2685
data = <optimized out>
__func__ = "darts_pull_task"
current_gpu = 0
task = <optimized out>
#10 0x0000155555012023 in starpu_sched_component_pump_to (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=component@entry=0xa91a90, child=child@entry=0xa91d80,
success=success@entry=0x15550c87b40c) at ../../src/sched_policies/component_sched.c:411
ret = 0
task = <optimized out>
#11 0x00001555550343c5 in darts_can_push (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=0xa91a90, to=0xa91d80) at ../../src/sched_policies/darts.c:3113
didwork = 0
--Type <RET> for more, q to quit, c to continue without paging--
task = <optimized out>
__func__ = "darts_can_push"
#12 0x00001555550132cb in starpu_sched_component_send_can_push_to_parents (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=component@entry=0xa91d80)
at ../../src/sched_policies/component_sched.c:685
__PRETTY_FUNCTION__ = "starpu_sched_component_send_can_push_to_parents"
i = 0
ret = 0
#13 0x0000155555013fab in fifo_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=0xa91d80, to=<optimized out>)
at ../../src/sched_policies/component_fifo.c:179
__PRETTY_FUNCTION__ = "fifo_pull_task"
data = 0xa91e90
queue = 0xa91e90
mutex = 0xa91ee0
now = 775762.92599999998
__func__ = "fifo_pull_task"
task = <optimized out>
#14 0x0000155555011dd6 in starpu_sched_component_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
from=0xa91d80, to=to@entry=0xa95a90)
at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#15 0x0000155555021c17 in best_implementation_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=0xa95a90, from=<optimized out>)
at ../../src/sched_policies/component_best_implementation.c:108
task = 0x0
i = 0
#16 0x0000155555011dd6 in starpu_sched_component_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
from=0xa95a90, to=to@entry=0xa957b0)
at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#17 0x000015555500ee48 in simple_worker_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
component=0xa957b0, to=<optimized out>)
at ../../src/sched_policies/component_worker.c:464
now = 775762.67500000005
workerid = 1
worker = <optimized out>
data = 0xa958c0
list = 0xa95a20
task = 0x0
i = 0
n_tries = 2
--Type <RET> for more, q to quit, c to continue without paging--
__func__ = "simple_worker_pull_task"
#18 0x0000155555011dd6 in starpu_sched_component_pull_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
from=0xa957b0, to=0x0)
at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#19 0x0000155554f7f125 in _starpu_pop_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
worker=worker@entry=0x1555552bd160 <_starpu_config+5952>)
at ../../src/core/sched_policy.c:1066
sched_ctx = 0x1555552fde08 <_starpu_config+271336>
task = 0x0
worker_id = <optimized out>
node = <optimized out>
profiling = 0
pop_start_time = {tv_sec = 1, tv_nsec = 4294967295}
pick = <optimized out>
i = <optimized out>
nbuffers = <optimized out>
#20 0x0000155554faf97e in _starpu_get_worker_task (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
worker=worker@entry=0x1555552bd160 <_starpu_config+5952>,
workerid=workerid@entry=1, memnode=memnode@entry=0) at ../../src/drivers/driver_common/driver_common.c:431
task = <optimized out>
keep_awake = <optimized out>
__func__ = "_starpu_get_worker_task"
#21 0x000015555503a89c in _starpu_cpu_driver_run_once (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
cpu_worker=cpu_worker@entry=0x1555552bd160 <_starpu_config+5952>)
at ../../src/drivers/cpu/driver_cpu.c:617
memnode = 0
workerid = 1
pi = {conf = 0x0, event_type = starpu_prof_tool_event_driver_init_end, starpu_version = {1, 4, 99},
thread_id = 210409216, worker_id = 1, task_name = 0x0, model_name = 0x0, device_number = 0,
driver_type = starpu_prof_tool_driver_cpu, memnode = 4294967295, bytes_to_transfer = 0, bytes_transfered = 0,
fun_ptr = 0x0}
res = <optimized out>
j = <optimized out>
task = 0x0
pending_task = 0x0
rank = 0
__func__ = "_starpu_cpu_driver_run_once"
continuation_wake_up = <optimized out>
__PRETTY_FUNCTION__ = "_starpu_cpu_driver_run_once"
--Type <RET> for more, q to quit, c to continue without paging--
#22 0x000015555503afe5 in _starpu_cpu_worker (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
arg=0x1555552bd160 <_starpu_config+5952>) at ../../src/drivers/cpu/driver_cpu.c:729
worker = 0x1555552bd160 <_starpu_config+5952>
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_transfer, starpu_version = {1, 4, 99},
thread_id = 210409216, worker_id = 1, task_name = 0x0, model_name = 0x0, device_number = 1,
driver_type = starpu_prof_tool_driver_cpu, memnode = 0, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
#23 0x0000155554ce3ea5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
#24 0x00001555544ecb0d in clone () from /lib64/libc.so.6
No symbol table info available.
Ah, thanks! I'm working on it.
@MaximeGonthier does DARTS pay attention to STARPU_TASK_GET_MODE(task, i)
? tasks only need to load data when (STARPU_TASK_GET_MODE(task, i) & STARPU_R) != 0
. We indeed probably never tried DARTS on a task with only STARPU_W
.
@MaximeGonthier probably the case that DARTS was never tried against was task_insert(t1, A, STARPU_W); task_insert(t2, A, STARPU_R); task_insert(t3, A, STARPU_W)
, in which case the value computed by t1 is indeed dropped once t2 is over. In other words, when DARTS receives a task that only writes to a data, it should mostly forget about this data.
DARTS uses STARPU_TASK_GET_MODE only to ignore data of type STARPU_SCRATCH and STARPU_REDUX, nothing else. But this should not cause any issue since DARTS does not directly load data it just returns a task. Maybe it's when it reads the data that it causes a crash?
In other words, when DARTS receives a task that only writes to a data, it should mostly forget about this data.
Yes ok I see I'll work on that, thanks
But this should not cause any issue since DARTS does not directly load data it just returns a task. Maybe it's when it reads the data that it causes a crash?
See the backtrace: DARTS is calling starpu_data_expected_transfer_time
, but when the data is getting discarded by a STARPU_W
access, there is no data to transfer, so calling that function does not make sense. But another way, the presence of the hardcoded STARPU_R
in DARTS is bogus, it has to use the task's access mode.
Yes I figured that out just now that starpu_data_expected_transfer_time
is the culprit in that case :)
the presence of the hardcoded STARPU_R in DARTS is bogus
Yes I agree. Do you think I should check STARPU_TASK_GET_MODE(task, i)
before each time a data is read or only for functions like starpu_data_expected_transfer_time
that are suppose to give me an output that cannot exist if it's only a W
Yes I agree. Do you think I should check
STARPU_TASK_GET_MODE(task, i)
before each time a data is read
It depends what you are doing with the data. If e.g. you record where a piece of data will be (to tend to put tasks that will read it there), you'll still want to see that even if the mode is STARPU_W
.
@Muxas the issue is fixed in the last commit from master (eedd54ac). The simple example presented above is now working Thanks
bb2b4a18 is a better version of the fix
I checked it on my side. Provided examples (presented above in this issue) work, but my software still fails with assert errors.
I tried master branch of GitLab remote, commit bb2b4a186b9c612aac499f6e9fcdf93f8c906d76. Here is config.log
error:
Thread 11 "CUDA 6" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x154b00cf8700 (LWP 2476737)]
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
_starpu_darts_scheduling_3D_matrix (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
main_task_list=main_task_list@entry=0x55555cecc0f0, current_gpu=current_gpu@entry=7,
g=0x55555cecc338, current_worker_id=<optimized out>) at ../../src/sched_policies/darts.c:1961
1961 for (t = _starpu_darts_task_using_data_list_begin(e->D->sched_data); t != _starpu_darts_task_using_data_list_end(e->D->sched_data); t = _starpu_darts_task_using_data_list_next(t))
Backtrace:
(gdb) bt full
Python Exception <type 'exceptions.ImportError'> No module named gdb.frames:
#0 _starpu_darts_scheduling_3D_matrix (main_task_list=main_task_list@entry=0x55555cecc0f0, current_gpu=current_gpu@entry=7,
g=0x55555cecc338, current_worker_id=<optimized out>) at ../../src/sched_policies/darts.c:1961
t = <optimized out>
i = <optimized out>
e = 0x55555748fa00
remaining_expected_length_max = 0
best_1_from_free_task = 0x0
temp_best_1_from_free_task = 0x0
number_free_task_max = 0
temp_number_free_task_max = 0
number_1_from_free_task_max = 0
temp_number_1_from_free_task_max = 0
priority_max = -2147483648
temp_priority_max = -2147483648
transfer_time_min = 1.7976931348623157e+308
temp_transfer_time_min = 0
ratio_transfertime_freetask_min = 1.7976931348623157e+308
temp_length_free_tasks_max = 0
handle_popped = 0x0
hud = <optimized out>
data_chosen_index = 0
__PRETTY_FUNCTION__ = "_starpu_darts_scheduling_3D_matrix"
__func__ = "_starpu_darts_scheduling_3D_matrix"
data_not_available = <optimized out>
data_available = true
choose_best_data_threshold = 2147483647
i = <optimized out>
#1 0x000015548e027fc3 in get_task_to_return_pull_task_darts (current_worker_id=<optimized out>, l=0x55555cecc0f0, current_gpu=7)
at ../../src/sched_policies/darts.c:2587
task = <optimized out>
__func__ = "get_task_to_return_pull_task_darts"
__PRETTY_FUNCTION__ = "get_task_to_return_pull_task_darts"
task = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--c
task = <optimized out>
i = <optimized out>
hud = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
task = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
i = <optimized out>
hud = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
p_ret = <optimized out>
__ptrs = {<optimized out> <repeats 32 times>}
__n = <optimized out>
#2 darts_pull_task (component=<optimized out>, to=<optimized out>) at ../../src/sched_policies/darts.c:2741
data = <optimized out>
__func__ = "darts_pull_task"
current_gpu = 7
task = <optimized out>
#3 0x000015548e006973 in starpu_sched_component_pump_to (component=component@entry=0x55555cecbfe0, child=child@entry=0x55555cee92e0, success=success@entry=0x154b00cf734c) at ../../src/sched_policies/component_sched.c:411
ret = 0
task = <optimized out>
#4 0x000015548e027065 in darts_can_push (component=0x55555cecbfe0, to=0x55555cee92e0) at ../../src/sched_policies/darts.c:3165
didwork = 0
task = <optimized out>
__func__ = "darts_can_push"
#5 0x000015548e007c1b in starpu_sched_component_send_can_push_to_parents (component=component@entry=0x55555cee92e0) at ../../src/sched_policies/component_sched.c:685
__PRETTY_FUNCTION__ = "starpu_sched_component_send_can_push_to_parents"
i = 0
ret = 0
#6 0x000015548e0088fb in fifo_pull_task (component=0x55555cee92e0, to=<optimized out>) at ../../src/sched_policies/component_fifo.c:179
__PRETTY_FUNCTION__ = "fifo_pull_task"
data = 0x55555cee93f0
queue = 0x55555cee93f0
mutex = 0x55555cee9440
now = 3916214.7369999997
__func__ = "fifo_pull_task"
task = <optimized out>
#7 0x000015548e006726 in starpu_sched_component_pull_task (from=0x55555cee92e0, to=to@entry=0x55555cf1bfd0) at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#8 0x000015548e016567 in best_implementation_pull_task (component=0x55555cf1bfd0, from=<optimized out>) at ../../src/sched_policies/component_best_implementation.c:108
task = 0x0
i = 0
#9 0x000015548e006726 in starpu_sched_component_pull_task (from=0x55555cf1bfd0, to=to@entry=0x55555cf1bcf0) at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#10 0x000015548e003798 in simple_worker_pull_task (component=0x55555cf1bcf0, to=<optimized out>) at ../../src/sched_policies/component_worker.c:464
now = 3916214.5819999999
workerid = 6
worker = <optimized out>
data = 0x55555cf1be00
list = 0x55555cf1bf60
task = 0x0
i = 0
n_tries = 1
__func__ = "simple_worker_pull_task"
#11 0x000015548e006726 in starpu_sched_component_pull_task (from=0x55555cf1bcf0, to=0x0) at ../../src/sched_policies/component_sched.c:391
task = <optimized out>
#12 0x000015548df73cc5 in _starpu_pop_task (worker=worker@entry=0x15548e2b4560 <_starpu_config+15232>) at ../../src/core/sched_policy.c:1066
sched_ctx = 0x15548e2f2dc8 <_starpu_config+271336>
task = 0x0
worker_id = <optimized out>
node = <optimized out>
profiling = 0
pop_start_time = {tv_sec = 23452907528816, tv_nsec = 23411880327648}
pick = <optimized out>
i = <optimized out>
nbuffers = <optimized out>
#13 0x000015548dfa4bd4 in _starpu_get_multi_worker_task (workers=<optimized out>, tasks=tasks@entry=0x154b00cf7bd0, nworkers=<optimized out>, memnode=<optimized out>) at ../../src/drivers/driver_common/driver_common.c:583
keep_awake = 0
i = <optimized out>
count = 0
j = <optimized out>
is_parallel_task = <optimized out>
combined_worker = <optimized out>
__func__ = "_starpu_get_multi_worker_task"
#14 0x000015548e035988 in _starpu_cuda_driver_run_once (worker=<optimized out>, worker@entry=0x15548e2b4560 <_starpu_config+15232>) at ../../src/drivers/cuda/driver_cuda.c:2404
worker_set = 0x15548e321ec0 <cuda_worker_set+768>
worker0 = 0x15548e2b4560 <_starpu_config+15232>
tasks = 0x154b00cf7bd0
task = <optimized out>
j = <optimized out>
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_transfer, starpu_version = {1, 4, 99}, thread_id = 13600512, worker_id = 6, task_name = 0x0, model_name = 0x0, device_number = 6, driver_type = starpu_prof_tool_driver_gpu, memnode = 7, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
i = <optimized out>
res = <optimized out>
idle_tasks = 1
idle_transfers = 1
__func__ = "_starpu_cuda_driver_run_once"
__PRETTY_FUNCTION__ = "_starpu_cuda_driver_run_once"
#15 0x000015548e03622d in _starpu_cuda_worker (Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
_arg=0x15548e2b4560 <_starpu_config+15232>) at ../../src/drivers/cuda/driver_cuda.c:2476
worker = 0x15548e2b4560 <_starpu_config+15232>
worker_set = 0x15548e321ec0 <cuda_worker_set+768>
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_transfer, starpu_version = {1, 4, 99}, thread_id = 13600512, worker_id = 6, task_name = 0x0, model_name = 0x0, device_number = 6, driver_type = starpu_prof_tool_driver_gpu, memnode = 7, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
i = <optimized out>
#16 0x000015555511dea5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
Python Exception <type 'exceptions.NameError'> Installation error: gdb._execute_unwinders function is missing:
#17 0x0000155554535b0d in clone () from /lib64/libc.so.6
No symbol table info available.
Is there a larger working example of your software, or can I easily try your software directly?
Well, I was going to create a docker
image for you, but with the latest change (unlink libnvidia-ml) docker
image with the latest master
branch of StarPU fails to build. Even export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
does not help. Do not know how to solve the issue.
The software is NNTile. You can try to follow instructions from a Dockerfile to build it. After NNTile is built, I will provide an example script to run it.
Alternatively, you can follow README of NNTile:
docker
image with StarPU-1.3.11 (docker pull ghcr.io/skolai/nntile:1.0.0-starpu1.3.11-cuda12.2.0-ubuntu22.04
)docker
container based on this imageCUDA_VISIBLE_DEVICES=0 STARPU_NCPU=2 python /workspace/nntile/wrappers/python/examples/gpt2_custom_training.py --config-path=/workspace/nntile/wrappers/python/examples/gpt2_default_config.json --tokenizer=gpt2 --tokenizer-path=data --batch=1024 --minibatch=4 --minibatch-tile=4 --seq-tile=1024 --embd-tile=768 --inner-tile=3072 --head-tile=12 --restrict=cuda --flashattention --nforward=10 --nforward-warmup=10 --nbackward=10 --nbackward-warmup=10 --dataset=WikiText-103 --dataset-path=data --dataset-select=40000 --optimizer=fusedadamw --optimizer-eps=1e-8 --weight-decay=0.1 --loss-reduction=mean --lr=3e-4 --start-lr=0 --full-lr-iter=10 --nepochs=1 --nepochs-warmup=1
@MaximeGonthier for a start you can run STARPU_SCHED=darts make check
, it should uncover quite a few issues already.
Hi,
Sorry for leaving your issue pending @Muxas, I was pretty busy the last few months but have more time now :) I've pushed some new fixes for the darts scheduler both in early May and today.
Would you be able to try your code with the latest commit from starpu?
In the mean time I'll also try to set up your software on our cluster to test it with darts
Thank you
Hi!
I tried latest commit 23fd79193ceccd1ad5223e510110d3e320ebf160 of https://gitlab.inria.fr/starpu/starpu and get the same error:
[starpu][_starpu_select_src_node][assert failure] The data for the handle 0x55555d087330 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?
python3: ../../src/datawizard/coherency.c:69: _starpu_select_src_node: Assertion `0 && "src_node_mask != 0"' failed.
Probably, the fact that your fixes did not solve the problem sounds depressing, but the first stage (forward pass) of training NNs now works! During the first phase neural network generates lots of temporary data, that is used only in second stage (backward pass).
That is good news indeed :)
Then I'm guessing that handle 0x55555d087330 is one of the temporary data used only in the second phase.
I'm wondering if this might be because I am ignoring some temporary data with:
#define STARPU_IGNORE_UTILITIES_HANDLES(task, index) if ((STARPU_TASK_GET_MODE(task, index) & STARPU_SCRATCH) || (STARPU_TASK_GET_MODE(task, index) & STARPU_REDUX)) { continue; }
I'll work on it :)
Do you have the trace/full error message? I would like to check if this is also caused by starpu_data_expected_transfer_time
@MaximeGonthier the same kind of issue is visible with the examples/filters/fread, fmultiple_manual, fmultiple_submit_readonly, fmultiple_submit_readonly_downgrade, fmultiple_submit_implicit, frecursive
tests, the backtrace looks like this:
#5 0x00007fcd6a8c1e32 in __GI___assert_fail (assertion=0x7fcd6acede01 "0 && \"src_node_mask != 0\"", file=0x7fcd6aceddea "datawizard/coherency.c", line=69, function=0x7fcd6acee5c0 <__PRETTY_FUNCTION__.29> "_starpu_select_src_node") at ./assert/assert.c:101
#6 0x00007fcd6ab9dbc5 in _starpu_select_src_node (handle=0x55a1641aaa70, destination=1) at datawizard/coherency.c:69
#7 0x00007fcd6ab4147a in starpu_data_expected_transfer_time (handle=0x55a1641aaa70, memory_node=1, mode=STARPU_R) at core/perfmodel/perfmodel.c:458
#8 0x00007fcd6ac8cfed in _starpu_darts_scheduling_3D_matrix (main_task_list=0x55a1640ff510, current_gpu=1, g=0x55a1640ff5a8, current_worker_id=0) at sched_policies/darts.c:1971
#9 0x00007fcd6ac91281 in get_task_to_return_pull_task_darts (current_gpu=1, l=0x55a1640ff510, current_worker_id=0) at sched_policies/darts.c:2602
It indeed seems doubtful to call starpu_data_expected_transfer_time
on the data. Possibly its value got dropped or such, and we don't need the expected transfer time unless there is a task that actually really needs it. Actually the code doesn't look like it checks for the task actually needing to read the data ? or is that implicit because it's in the _starpu_darts_task_using_data_list
? But I don't see a test for that in e.g. initialize_task_data_gpu_single_task_no_dependencies
We don't want to associate tasks with data when the tasks are actually only writing to it, as there is no affinity at all in that case.
I agree for not associating the task with these data however it did not solve the issue.
The list of data we are looping over is _starpu_darts_gpu_data_not_used_list
and we do not add in this list the data that are accessed only with a write (in initialize_task_data_gpu_single_task_no_dependencies
).
So maybe what's happening is that the data is used for a RW, and the task using it in read have been completed and we are just left with task using it as a W, and thus we don't need to check this data at all, and it would probably fix the issue.
Hi @Muxas @sthibaul, I've pushed in the latest commit (530b50da) a fix to allow darts to pass the tests from examples/filters/fread, fmultiple_manual, fmultiple_submit_readonly, fmultiple_submit_readonly_downgrade, fmultiple_submit_implicit, frecursive
.
Hopefully it should also fix the [starpu][_starpu_select_src_node][assert failure] The data for the handle 0x55555d087330 is requested, but the handle does not have a valid value. Perhaps some initialization task is missing?
error you had with your code
@MaximeGonthier Hi! The first impression is the error is gone. Now the second phase (so-called backward pass) of training a neural networks also works! But the last phase (updating weights of a neural network) returns an error like:
double free or corruption (out)
[1] 52986 IOT instruction (core dumped)
It happens in the end of execution during clean up. Here is a backtrace:
#0 0x00007ffff7d2f743 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007ffff7d2f969 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2 0x00007ffff7d30ea0 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3 0x00007ffff7d33453 in free () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#4 0x00007fffd1a22b80 in cublasDestroy_v2 () from /home/jovyan/.mlspace/envs/nntile/bin/../lib/libcublas.so.12
No symbol table info available.
#5 0x00007fffd1a1a451 in ?? () from /home/jovyan/.mlspace/envs/nntile/bin/../lib/libcublas.so.12
No symbol table info available.
#6 0x00007fffd1a2bffc in cublasShutdown () from /home/jovyan/.mlspace/envs/nntile/bin/../lib/libcublas.so.12
No symbol table info available.
#7 0x00007fff31ffbbf6 in shutdown_cublas_func (args=<optimized out>) at ../../src/drivers/cuda/starpu_cublas.c:75
idx = 0
devid = <optimized out>
__func__ = "shutdown_cublas_func"
#8 0x00007fff31f9eae0 in wrapper_func (buffers=<optimized out>, _args=0x7fffffffa4a0) at ../../src/util/execute_on_all.c:43
args = 0x7fffffffa4a0
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_gpu_exec, starpu_version = {1, 4, 99}, thread_id = -478153152, worker_id = 0, task_name = 0x0, model_name = 0x0,
device_number = 0, driver_type = starpu_prof_tool_driver_gpu, memnode = 4294967295, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x7fff31ffbad0 <shutdown_cublas_func>}
worker = 0
#9 0x00007fff31ff9eb9 in start_job_on_cuda (pipeline_idx=1 '\001', worker=0x7fff3205a980 <_starpu_config+6528>, j=0x55555d5c90a0) at ../../src/drivers/cuda/driver_cuda.c:2139
func = 0x7fff31f9ea10 <wrapper_func>
task = 0x55555cafbd70
profiling = <optimized out>
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_gpu_exec, starpu_version = {1, 4, 99}, thread_id = -478153152, worker_id = 0,
task_name = 0x7fff32017dd4 "execute_on_all_wrapper", model_name = 0x7fff32017dd4 "execute_on_all_wrapper", device_number = 0, driver_type = starpu_prof_tool_driver_gpu,
memnode = 4294967295, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x7fff31f9ea10 <wrapper_func>}
cl = <optimized out>
task = <optimized out>
profiling = <optimized out>
pi = <optimized out>
cl = <optimized out>
func = <optimized out>
__func__ = "start_job_on_cuda"
__ptrs = <optimized out>
__n = <optimized out>
__ret = <optimized out>
__ptrs = <optimized out>
__n = <optimized out>
__ptrs = <optimized out>
__n = <optimized out>
__ptrs = <optimized out>
__n = <optimized out>
__args = <optimized out>
__args = <optimized out>
#10 execute_job_on_cuda (task=0x55555cafbd70, worker=0x7fff3205a980 <_starpu_config+6528>) at ../../src/drivers/cuda/driver_cuda.c:2164
workerid = 0
j = 0x55555d5c90a0
--Type <RET> for more, q to quit, c to continue without paging--c
pipeline_idx = <optimized out>
__func__ = "execute_job_on_cuda"
#11 0x00007fff31ffa8f3 in _starpu_cuda_driver_run_once (worker=worker@entry=0x7fff3205a980 <_starpu_config+6528>) at ../../src/drivers/cuda/driver_cuda.c:2317
workerid = 0
memnode = 1
cures = <optimized out>
worker_set = 0x7fff320e0540 <cuda_worker_set>
worker0 = 0x7fff3205a980 <_starpu_config+6528>
tasks = 0x7ffee37feaf0
task = <optimized out>
j = 0x55555d5c90a0
pi = {conf = 0x0, event_type = starpu_prof_tool_event_end_transfer, starpu_version = {1, 4, 99}, thread_id = -478153152, worker_id = 0, task_name = 0x0, model_name = 0x0, device_number = 0, driver_type = starpu_prof_tool_driver_gpu, memnode = 1, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
i = 0
res = <optimized out>
idle_tasks = <optimized out>
idle_transfers = <optimized out>
__func__ = "_starpu_cuda_driver_run_once"
#12 0x00007fff31ffb2fe in _starpu_cuda_worker (_arg=0x7fff3205a980 <_starpu_config+6528>) at ../../src/drivers/cuda/driver_cuda.c:2492
worker = 0x7fff3205a980 <_starpu_config+6528>
worker_set = 0x7fff320e0540 <cuda_worker_set>
pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_transfer, starpu_version = {1, 4, 99}, thread_id = -478153152, worker_id = 0, task_name = 0x0, model_name = 0x0, device_number = 0, driver_type = starpu_prof_tool_driver_gpu, memnode = 1, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
i = <optimized out>
#13 0x00007ffff7d22ac3 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#14 0x00007ffff7db4850 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
Running another time I get:
free(): invalid pointer
It seems that DARTS set something to NULL and the NULL value is then freed. Here is a backtrace:
free(): invalid pointer
Thread 1 "python" received signal SIGABRT, Aborted.
0x00007ffff7d249fc in pthread_kill () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) bt full
#0 0x00007ffff7d249fc in pthread_kill () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007ffff7cd0476 in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2 0x00007ffff7cb67f3 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3 0x00007ffff7d17676 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#4 0x00007ffff7d2ecfc in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#5 0x00007ffff7d30a44 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#6 0x00007ffff7d33453 in free () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#7 0x00007fff31fe7117 in _starpu_darts_gpu_data_not_used_delete (e=<optimized out>) at ../../src/sched_policies/darts.c:92
No locals.
#8 _if_found_erase_data_from_data_not_used_yet_of_all_pu (data_to_remove=0x55555db365a0) at ../../src/sched_policies/darts.c:522
i = 0
hud = <optimized out>
#9 0x00007fff31fe71d5 in unregister_data_all_pu (data_to_remove=0x55555db365a0) at ../../src/sched_policies/darts.c:534
__func__ = "unregister_data_all_pu"
hud = <optimized out>
#10 0x00007fff31f90a7f in _starpu_data_unregister (handle=0x55555db365a0, coherent=<optimized out>, nowait=<optimized out>) at ../../src/datawizard/interfaces/data_interface.c:820
__func__ = "_starpu_data_unregister"
sequential_consistency = <optimized out>
node = <optimized out>
a = 0x55555d8c6b50
size = <optimized out>
__ptrs = <optimized out>
__n = <optimized out>
#11 0x00007fff3239cc4e in std::_Sp_counted_deleter<_starpu_data_state*, void (*)(_starpu_data_state*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:527
No locals.
#12 0x00007fff323a108a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555db36f70)
at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:346
__wordbits = <optimized out>
__shiftbits = <optimized out>
__unique_ref = <optimized out>
__both_counts = <optimized out>
__lock_free = <optimized out>
__double_word = <optimized out>
__aligned = <optimized out>
#13 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x55555db36f70)
at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:317
__lock_free = true
__double_word = true
__aligned = true
__lock_free = <optimized out>
__double_word = <optimized out>
__aligned = <optimized out>
__wordbits = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--c
__shiftbits = <optimized out>
__unique_ref = <optimized out>
__both_counts = <optimized out>
#14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:1071
No locals.
#15 std::__shared_ptr<_starpu_data_state, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:1524
No locals.
#16 std::__shared_ptr<_starpu_data_state, (__gnu_cxx::_Lock_policy)2>::reset (this=<optimized out>) at /home/jovyan/.mlspace/envs/nntile/x86_64-conda-linux-gnu/include/c++/12.3.0/bits/shared_ptr_base.h:1642
No locals.
#17 nntile::starpu::Handle::unregister (this=<optimized out>) at /home/jovyan/nntile/nntile/include/nntile/starpu/config.hh:254
No locals.
#18 nntile::tensor::Tensor<float>::unregister (this=0x55555db36230) at /home/jovyan/nntile/nntile/include/nntile/tensor/tensor.hh:132
i = 0
#19 0x00007fff323da05c in pybind11::cpp_function::cpp_function<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}::operator()(nntile::tensor::Tensor<float>*) const (__closure=<optimized out>, c=<optimized out>) at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/pybind11.h:111
f = <optimized out>
f = <optimized out>
#20 pybind11::detail::argument_loader<nntile::tensor::Tensor<float>*>::call_impl<void, pybind11::cpp_function::cpp_function<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&, 0ul, pybind11::detail::void_type>(pybind11::cpp_function::cpp_function<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&, std::integer_sequence<unsigned long, 0ul>, pybind11::detail::void_type&&) && (f=..., this=0x7fffffffacc0) at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/cast.h:1480
No locals.
#21 pybind11::detail::argument_loader<nntile::tensor::Tensor<float>*>::call<void, pybind11::detail::void_type, pybind11::cpp_function::cpp_function<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&>(pybind11::cpp_function::cpp_function<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&) && (f=..., this=0x7fffffffacc0) at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/cast.h:1454
No locals.
#22 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}, void, nntile::tensor::Tensor<float>*, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&&, void (*)(nntile::tensor::Tensor<float>*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::operator()(pybind11::detail::function_call&) const (__closure=0x0, call=...) at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/pybind11.h:254
args_converter = {static kwargs_pos = -1, static args_pos = -1, static arg_names = {text = "{%}"}, argcasters = {<std::_Tuple_impl<0, pybind11::detail::type_caster<nntile::tensor::Tensor<float>, void> >> = {<std::_Head_base<0, pybind11::detail::type_caster<nntile::tensor::Tensor<float>, void>, false>> = {_M_head_impl = {<pybind11::detail::type_caster_base<nntile::tensor::Tensor<float> >> = {<pybind11::detail::type_caster_generic> = {typeinfo = 0x55555b5fd690, cpptype = 0x7fff3240c888 <typeinfo for nntile::tensor::Tensor<float>>, value = 0x55555db36230}, static name = <same as static member of an already seen type>}, <No data fields>}}, <No data fields>}, <No data fields>}}
data = <optimized out>
policy = <optimized out>
cap = <optimized out>
result = <optimized out>
#23 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}, void, nntile::tensor::Tensor<float>*, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&&, void (*)(nntile::tensor::Tensor<float>*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/pybind11.h:224
No locals.
#24 0x00007fff323c2555 in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff19bc26e0, kwargs_in=0x0) at /home/jovyan/nntile/nntile/build-master/_deps/pybind11-src/include/pybind11/pybind11.h:946
guard = {parent = 0x0, keep_alive = {_M_h = {<std::__detail::_Hashtable_base<_object*, _object*, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Hashtable_traits<false, true, true> >> = {<std::__detail::_Hash_code_base<_object*, _object*, std::__detail::_Identity, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, false>> = {<std::__detail::_Hashtable_ebo_helper<1, std::hash<_object*>, true>> = {<std::hash<_object*>> = {<std::__hash_base<unsigned long, _object*>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <std::__detail::_Hashtable_ebo_helper<0, std::equal_to<_object*>, true>> = {<std::equal_to<_object*>> = {<std::binary_function<_object*, _object*, bool>> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <std::__detail::_Map_base<_object*, _object*, std::allocator<_object*>, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true>, true>> = {<No data fields>}, <std::__detail::_Insert<_object*, _object*, std::allocator<_object*>, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true>, true>> = {<std::__detail::_Insert_base<_object*, _object*, std::allocator<_object*>, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true> >> = {<No data fields>}, <No data fields>}, <std::__detail::_Rehash_base<_object*, _object*, std::allocator<_object*>, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true>, std::integral_constant<bool, true> >> = {<No data fields>}, <std::__detail::_Equality<_object*, _object*, std::allocator<_object*>, std::__detail::_Identity, std::equal_to<_object*>, std::hash<_object*>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, true, true>, true>> = {<No data fields>}, <std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<_object*, false> > >> = {<std::__detail::_Hashtable_ebo_helper<0, std::allocator<std::__detail::_Hash_node<_object*, false> >, true>> = {<std::allocator<std::__detail::_Hash_node<_object*, false> >> = {<std::__new_allocator<std::__detail::_Hash_node<_object*, false> >> = {<No data fields>}, <No data fields>}, <No data fields>}, <No data fields>}, <std::_Enable_default_constructor<true, std::__detail::_Hash_node_base>> = {<No data fields>}, _M_buckets = 0x7fffffffae68, _M_bucket_count = 1, _M_before_begin = {_M_nxt = 0x0}, _M_element_count = 0, _M_rehash_policy = {static _S_growth_factor = 2, _M_max_load_factor = 1, _M_next_resize = 0}, _M_single_bucket = 0x0}}}
num_args = <optimized out>
call = {func = @0x55555b5ff0e0, args = {<std::_Vector_base<pybind11::handle, std::allocator<pybind11::handle> >> = {_M_impl = {<std::allocator<pybind11::handle>> = {<std::__new_allocator<pybind11::handle>> = {<No data fields>}, <No data fields>}, <std::_Vector_base<pybind11::handle, std::allocator<pybind11::handle> >::_Vector_impl_data> = {_M_start = 0x7ffecee5ed00, _M_finish = 0x7ffecee5ed08, _M_end_of_storage = 0x7ffecee5ed08}, <No data fields>}}, <No data fields>}, args_convert = {<std::_Bvector_base<std::allocator<bool> >> = {_M_impl = {<std::allocator<unsigned long>> = {<std::__new_allocator<unsigned long>> = {<No data fields>}, <No data fields>}, <std::_Bvector_base<std::allocator<bool> >::_Bvector_impl_data> = {_M_start = {<std::_Bit_iterator_base> = {<std::iterator<std::random_access_iterator_tag, bool, long, bool*, bool&>> = {<No data fields>}, _M_p = 0x7ffaac0a5240, _M_offset = 0}, <No data fields>}, _M_finish = {<std::_Bit_iterator_base> = {<std::iterator<std::random_access_iterator_tag, bool, long, bool*, bool&>> = {<No data fields>}, _M_p = 0x7ffaac0a5240, _M_offset = 1}, <No data fields>}, _M_end_of_storage = 0x7ffaac0a5248}, <No data fields>}}, <No data fields>}, args_ref = {<pybind11::handle> = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x0}, <No data fields>}, kwargs_ref = {<pybind11::handle> = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x0}, <No data fields>}, parent = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x7fff0b2d3cf0}, init_self = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x0}}
positional_args_copied = <optimized out>
kwargs = {<pybind11::object> = {<pybind11::handle> = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x0}, <No data fields>}, <No data fields>}
func = @0x55555b5ff0e0: {name = 0x55555b5ff040 "unregister", doc = 0x0, signature = 0x55555b5ff1e0 "(self: nntile.nntile_core.tensor.Tensor_fp32) -> None", args = {<std::_Vector_base<pybind11::detail::argument_record, std::allocator<pybind11::detail::argument_record> >> = {_M_impl = {<std::allocator<pybind11::detail::argument_record>> = {<std::__new_allocator<pybind11::detail::argument_record>> = {<No data fields>}, <No data fields>}, <std::_Vector_base<pybind11::detail::argument_record, std::allocator<pybind11::detail::argument_record> >::_Vector_impl_data> = {_M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}, <No data fields>}}, <No data fields>}, impl = 0x7fff323d9ff0 <pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}, void, nntile::tensor::Tensor<float>*, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, nntile::tensor::Tensor<float>, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (nntile::tensor::Tensor<float>::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(nntile::tensor::Tensor<float>*)#1}&&, void (*)(nntile::tensor::Tensor<float>*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)>, data = {0x7fff323a0fc0 <nntile::tensor::Tensor<float>::unregister()>, 0x0, 0x0}, free_data = 0x0, policy = pybind11::return_value_policy::automatic, is_constructor = false, is_new_style_constructor = false, is_stateless = false, is_operator = false, is_method = true, is_setter = false, has_args = false, has_kwargs = false, prepend = false, nargs = 1, nargs_pos = 1, nargs_pos_only = 0, def = 0x55555b5ff170, scope = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x55555b5fe880}, sibling = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x555555a92c80 <_Py_NoneStruct>}, next = 0x0}
pos_args = <optimized out>
args_to_copy = <optimized out>
bad_arg = false
second_pass_convert = {<std::_Bvector_base<std::allocator<bool> >> = {_M_impl = {<std::allocator<unsigned long>> = {<std::__new_allocator<unsigned long>> = {<No data fields>}, <No data fields>}, <std::_Bvector_base<std::allocator<bool> >::_Bvector_impl_data> = {_M_start = {<std::_Bit_iterator_base> = {<std::iterator<std::random_access_iterator_tag, bool, long, bool*, bool&>> = {<No data fields>}, _M_p = 0x0, _M_offset = 0}, <No data fields>}, _M_finish = {<std::_Bit_iterator_base> = {<std::iterator<std::random_access_iterator_tag, bool, long, bool*, bool&>> = {<No data fields>}, _M_p = 0x0, _M_offset = 0}, <No data fields>}, _M_end_of_storage = 0x0}, <No data fields>}}, <No data fields>}
args_copied = <optimized out>
second_pass = {<std::_Vector_base<pybind11::detail::function_call, std::allocator<pybind11::detail::function_call> >> = {_M_impl = {<std::allocator<pybind11::detail::function_call>> = {<std::__new_allocator<pybind11::detail::function_call>> = {<No data fields>}, <No data fields>}, <std::_Vector_base<pybind11::detail::function_call, std::allocator<pybind11::detail::function_call> >::_Vector_impl_data> = {_M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}, <No data fields>}}, <No data fields>}
overloaded = <optimized out>
overloads = 0x55555b5ff0e0
it = 0x55555b5ff0e0
n_args_in = 1
parent = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x7fff0b2d3cf0}
result = {<pybind11::detail::object_api<pybind11::handle>> = {<pybind11::detail::pyobject_tag> = {<No data fields>}, <No data fields>}, m_ptr = 0x1}
self_value_and_holder = {inst = 0x0, index = 0, type = 0x0, vh = 0x0}
append_note_if_missing_header_is_suspected = <optimized out>
#25 0x0000555555755626 in cfunction_call (func=0x7fff32444860, args=<optimized out>, kwargs=<optimized out>) at /usr/local/src/conda/python-3.11.8/Objects/methodobject.c:542
tstate = 0x555555ad0558 <_PyRuntime+166328>
flags = <optimized out>
meth = <optimized out>
self = <optimized out>
result = <optimized out>
#26 0x0000555555734323 in _PyObject_MakeTpCall (tstate=0x555555ad0558 <_PyRuntime+166328>, callable=0x7fff32444860, args=<optimized out>, nargs=1, keywords=0x0) at /usr/local/src/conda/python-3.11.8/Objects/call.c:214
call = <optimized out>
argstuple = 0x7fff19bc26e0
kwdict = 0x0
result = 0x0
#27 0x0000555555741e36 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.8/Python/ceval.c:4769
is_meth = 1
total_args = 1
function = 0x7fff32444860
positional_args = <optimized out>
res = <optimized out>
opcode = <optimized out>
oparg = <optimized out>
eval_breaker = <optimized out>
cframe = {use_tracing = 0 '\000', current_frame = 0x7ffff7fb2240, previous = 0x555555ad06a8 <_PyRuntime+166664>}
call_shape = {kwnames = 0x0}
prev_cframe = <optimized out>
names = <optimized out>
consts = <optimized out>
first_instr = <optimized out>
next_instr = 0x7fff32aa54d0
stack_pointer = <optimized out>
exception_unwind = <optimized out>
dying = <optimized out>
__func__ = "_PyEval_EvalFrameDefault"
opcode_targets = {0x5555556586b6 <_PyEval_EvalFrameDefault-954474>, 0x5555557421cf <_PyEval_EvalFrameDefault+2735>, 0x555555741d4e <_PyEval_EvalFrameDefault+1582>, 0x555555744339 <_PyEval_EvalFrameDefault+11289>, 0x5555557453d7 <_PyEval_EvalFrameDefault+15543>, 0x55555574320b <_PyEval_EvalFrameDefault+6891>, 0x555555744e8d <_PyEval_EvalFrameDefault+14189>, 0x555555746df9 <_PyEval_EvalFrameDefault+22233>, 0x5555557427fe <_PyEval_EvalFrameDefault+4318>, 0x555555743742 <_PyEval_EvalFrameDefault+8226>, 0x5555557472eb <_PyEval_EvalFrameDefault+23499>, 0x555555746285 <_PyEval_EvalFrameDefault+19301>, 0x555555746620 <_PyEval_EvalFrameDefault+20224>, 0x555555744011 <_PyEval_EvalFrameDefault+10481>, 0x555555744861 <_PyEval_EvalFrameDefault+12609>, 0x555555747078 <_PyEval_EvalFrameDefault+22872>, 0x555555743873 <_PyEval_EvalFrameDefault+8531>, 0x555555742767 <_PyEval_EvalFrameDefault+4167>, 0x5555557447c2 <_PyEval_EvalFrameDefault+12450>, 0x555555746b61 <_PyEval_EvalFrameDefault+21569>, 0x555555743778 <_PyEval_EvalFrameDefault+8280>, 0x55555574581b <_PyEval_EvalFrameDefault+16635>, 0x555555741d90 <_PyEval_EvalFrameDefault+1648>, 0x555555742472 <_PyEval_EvalFrameDefault+3410>, 0x55555574331c <_PyEval_EvalFrameDefault+7164>, 0x555555745b25 <_PyEval_EvalFrameDefault+17413>, 0x5555557439bc <_PyEval_EvalFrameDefault+8860>, 0x5555557456b7 <_PyEval_EvalFrameDefault+16279>, 0x555555742316 <_PyEval_EvalFrameDefault+3062>, 0x5555557440e9 <_PyEval_EvalFrameDefault+10697>, 0x555555748084 <_PyEval_EvalFrameDefault+26980>, 0x5555557482f9 <_PyEval_EvalFrameDefault+27609>, 0x555555747525 <_PyEval_EvalFrameDefault+24069>, 0x555555747e15 <_PyEval_EvalFrameDefault+26357>, 0x555555744365 <_PyEval_EvalFrameDefault+11333>, 0x555555745f16 <_PyEval_EvalFrameDefault+18422>, 0x555555745f76 <_PyEval_EvalFrameDefault+18518>, 0x555555657274 <_PyEval_EvalFrameDefault-959660>, 0x555555741d03 <_PyEval_EvalFrameDefault+1507>, 0x5555557426d7 <_PyEval_EvalFrameDefault+4023>, 0x5555557425e3 <_PyEval_EvalFrameDefault+3779>, 0x555555743abd <_PyEval_EvalFrameDefault+9117>, 0x5555557430a7 <_PyEval_EvalFrameDefault+6535>, 0x555555745539 <_PyEval_EvalFrameDefault+15897>, 0x555555743037 <_PyEval_EvalFrameDefault+6423>, 0x555555742667 <_PyEval_EvalFrameDefault+3911>, 0x55555574192b <_PyEval_EvalFrameDefault+523>, 0x5555557468a6 <_PyEval_EvalFrameDefault+20870>, 0x555555741b25 <_PyEval_EvalFrameDefault+1029>, 0x55555574768c <_PyEval_EvalFrameDefault+24428>, 0x555555658344 <_PyEval_EvalFrameDefault-955356>, 0x555555658433 <_PyEval_EvalFrameDefault-955117>, 0x555555657865 <_PyEval_EvalFrameDefault-958139>, 0x555555746346 <_PyEval_EvalFrameDefault+19494>, 0x555555656e1f <_PyEval_EvalFrameDefault-960769>, 0x555555741f63 <_PyEval_EvalFrameDefault+2115>, 0x555555742b4a <_PyEval_EvalFrameDefault+5162>, 0x555555746c97 <_PyEval_EvalFrameDefault+21879>, 0x555555745bac <_PyEval_EvalFrameDefault+17548>, 0x5555557420f4 <_PyEval_EvalFrameDefault+2516>, 0x55555574851a <_PyEval_EvalFrameDefault+28154>, 0x555555746830 <_PyEval_EvalFrameDefault+20752>, 0x5555557444df <_PyEval_EvalFrameDefault+11711>, 0x5555557428da <_PyEval_EvalFrameDefault+4538>, 0x55555574206a <_PyEval_EvalFrameDefault+2378>, 0x555555742abd <_PyEval_EvalFrameDefault+5021>, 0x5555557441c2 <_PyEval_EvalFrameDefault+10914>, 0x55555574511b <_PyEval_EvalFrameDefault+14843>, 0x555555743a55 <_PyEval_EvalFrameDefault+9013>, 0x5555557483fc <_PyEval_EvalFrameDefault+27868>, 0x555555748484 <_PyEval_EvalFrameDefault+28004>, 0x555555747002 <_PyEval_EvalFrameDefault+22754>, 0x555555744639 <_PyEval_EvalFrameDefault+12057>, 0x555555742c73 <_PyEval_EvalFrameDefault+5459>, 0x555555656dd9 <_PyEval_EvalFrameDefault-960839>, 0x5555557461a4 <_PyEval_EvalFrameDefault+19076>, 0x55555574221a <_PyEval_EvalFrameDefault+2810>, 0x55555574311e <_PyEval_EvalFrameDefault+6654>, 0x555555743c18 <_PyEval_EvalFrameDefault+9464>, 0x555555744b6a <_PyEval_EvalFrameDefault+13386>, 0x555555742e24 <_PyEval_EvalFrameDefault+5892>, 0x555555743eff <_PyEval_EvalFrameDefault+10207>, 0x55555574717d <_PyEval_EvalFrameDefault+23133>, 0x5555557419a4 <_PyEval_EvalFrameDefault+644>, 0x5555557470e0 <_PyEval_EvalFrameDefault+22976>, 0x555555748237 <_PyEval_EvalFrameDefault+27415>, 0x55555574394b <_PyEval_EvalFrameDefault+8747>, 0x555555658035 <_PyEval_EvalFrameDefault-956139>, 0x555555656cef <_PyEval_EvalFrameDefault-961073>, 0x555555746007 <_PyEval_EvalFrameDefault+18663>, 0x555555745dff <_PyEval_EvalFrameDefault+18143>, 0x555555746f09 <_PyEval_EvalFrameDefault+22505>, 0x555555746f75 <_PyEval_EvalFrameDefault+22613>, 0x555555741c9e <_PyEval_EvalFrameDefault+1406>, 0x555555747b81 <_PyEval_EvalFrameDefault+25697>, 0x555555745f03 <_PyEval_EvalFrameDefault+18403>, 0x555555745b38 <_PyEval_EvalFrameDefault+17432>, 0x55555574800e <_PyEval_EvalFrameDefault+26862>, 0x555555658a85 <_PyEval_EvalFrameDefault-953499>, 0x555555743dcd <_PyEval_EvalFrameDefault+9901>, 0x555555741890 <_PyEval_EvalFrameDefault+368>, 0x55555574648b <_PyEval_EvalFrameDefault+19819>, 0x555555743e19 <_PyEval_EvalFrameDefault+9977>, 0x555555744999 <_PyEval_EvalFrameDefault+12921>, 0x555555747da1 <_PyEval_EvalFrameDefault+26241>, 0x555555744577 <_PyEval_EvalFrameDefault+11863>, 0x5555557458ce <_PyEval_EvalFrameDefault+16814>, 0x555555743d28 <_PyEval_EvalFrameDefault+9736>, 0x555555747587 <_PyEval_EvalFrameDefault+24167>, 0x5555557471f5 <_PyEval_EvalFrameDefault+23253>, 0x5555557432e3 <_PyEval_EvalFrameDefault+7107>, 0x555555745382 <_PyEval_EvalFrameDefault+15458>, 0x555555745c53 <_PyEval_EvalFrameDefault+17715>, 0x555555744ff3 <_PyEval_EvalFrameDefault+14547>, 0x555555742002 <_PyEval_EvalFrameDefault+2274>, 0x555555742167 <_PyEval_EvalFrameDefault+2631>, 0x555555744d92 <_PyEval_EvalFrameDefault+13938>, 0x5555557435ca <_PyEval_EvalFrameDefault+7850>, 0x555555743b5a <_PyEval_EvalFrameDefault+9274>, 0x555555747749 <_PyEval_EvalFrameDefault+24617>, 0x555555743d3b <_PyEval_EvalFrameDefault+9755>, 0x5555557459cf <_PyEval_EvalFrameDefault+17071>, 0x555555742a37 <_PyEval_EvalFrameDefault+4887>, 0x555555744a10 <_PyEval_EvalFrameDefault+13040>, 0x55555574183a <_PyEval_EvalFrameDefault+282>, 0x555555741c3f <_PyEval_EvalFrameDefault+1311>, 0x555555747280 <_PyEval_EvalFrameDefault+23392>, 0x555555746aaf <_PyEval_EvalFrameDefault+21391>, 0x5555557436e6 <_PyEval_EvalFrameDefault+8134>, 0x55555574381b <_PyEval_EvalFrameDefault+8443>, 0x555555747353 <_PyEval_EvalFrameDefault+23603>, 0x555555658273 <_PyEval_EvalFrameDefault-955565>, 0x555555744f47 <_PyEval_EvalFrameDefault+14375>, 0x555555744909 <_PyEval_EvalFrameDefault+12777>, 0x555555744c12 <_PyEval_EvalFrameDefault+13554>, 0x555555744cc9 <_PyEval_EvalFrameDefault+13737>, 0x555555745979 <_PyEval_EvalFrameDefault+16985>, 0x555555741a4a <_PyEval_EvalFrameDefault+810>, 0x5555557462ed <_PyEval_EvalFrameDefault+19405>, 0x55555565912a <_PyEval_EvalFrameDefault-951798>, 0x555555747fc0 <_PyEval_EvalFrameDefault+26784>, 0x555555742d9d <_PyEval_EvalFrameDefault+5757>, 0x555555746074 <_PyEval_EvalFrameDefault+18772>, 0x55555574240b <_PyEval_EvalFrameDefault+3307>, 0x555555746d17 <_PyEval_EvalFrameDefault+22007>, 0x555555742c0c <_PyEval_EvalFrameDefault+5356>, 0x555555744c4e <_PyEval_EvalFrameDefault+13614>, 0x555555746d5a <_PyEval_EvalFrameDefault+22074>, 0x55555574835b <_PyEval_EvalFrameDefault+27707>, 0x55555574296d <_PyEval_EvalFrameDefault+4685>, 0x5555557418de <_PyEval_EvalFrameDefault+446>, 0x555555746694 <_PyEval_EvalFrameDefault+20340>, 0x555555747474 <_PyEval_EvalFrameDefault+23892>, 0x55555574561a <_PyEval_EvalFrameDefault+16122>, 0x555555743642 <_PyEval_EvalFrameDefault+7970>, 0x5555557458e1 <_PyEval_EvalFrameDefault+16833>, 0x555555747ca7 <_PyEval_EvalFrameDefault+25991>, 0x555555745cce <_PyEval_EvalFrameDefault+17838>, 0x55555574445f <_PyEval_EvalFrameDefault+11583>, 0x555555746514 <_PyEval_EvalFrameDefault+19956>, 0x555555745607 <_PyEval_EvalFrameDefault+16103>, 0x555555741aa4 <_PyEval_EvalFrameDefault+900>, 0x555555748164 <_PyEval_EvalFrameDefault+27204>, 0x5555557480f5 <_PyEval_EvalFrameDefault+27093>, 0x555555745e91 <_PyEval_EvalFrameDefault+18289>, 0x555555747c29 <_PyEval_EvalFrameDefault+25865>, 0x5555557449fd <_PyEval_EvalFrameDefault+13021>, 0x555555742fa8 <_PyEval_EvalFrameDefault+6280>, 0x555555745a84 <_PyEval_EvalFrameDefault+17252>, 0x55555574439b <_PyEval_EvalFrameDefault+11387>, 0x555555744abb <_PyEval_EvalFrameDefault+13211>, 0x555555741f12 <_PyEval_EvalFrameDefault+2034>, 0x555555743d85 <_PyEval_EvalFrameDefault+9829>, 0x555555746eb4 <_PyEval_EvalFrameDefault+22420>, 0x5555557481de <_PyEval_EvalFrameDefault+27326>, 0x5555557454b3 <_PyEval_EvalFrameDefault+15763>, 0x555555746790 <_PyEval_EvalFrameDefault+20592>, 0x555555747bfd <_PyEval_EvalFrameDefault+25821>, 0x5555557469a8 <_PyEval_EvalFrameDefault+21128>, 0x555555745285 <_PyEval_EvalFrameDefault+15205>, 0x555555743548 <_PyEval_EvalFrameDefault+7720>, 0x5555556592bb <_PyEval_EvalFrameDefault-951397> <repeats 74 times>, 0x5555557479ef <_PyEval_EvalFrameDefault+25295>}
#28 0x00005555557f842d in _PyEval_EvalFrame (throwflag=0, frame=0x7ffff7fb2020, tstate=0x555555ad0558 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.8/Include/internal/pycore_ceval.h:73
No locals.
#29 _PyEval_Vector (tstate=0x555555ad0558 <_PyRuntime+166328>, func=0x7ffff7c02020, locals=<optimized out>, args=0x0, argcount=0, kwnames=0x0) at /usr/local/src/conda/python-3.11.8/Python/ceval.c:6434
frame = 0x7ffff7fb2020
retval = <optimized out>
#30 0x00005555557f7abf in PyEval_EvalCode (co=0x555555c848a0, globals=<optimized out>, locals=0x7ffff79d0180) at /usr/local/src/conda/python-3.11.8/Python/ceval.c:1148
tstate = 0x555555ad0558 <_PyRuntime+166328>
builtins = <optimized out>
desc = {fc_globals = 0x7ffff79d0180, fc_builtins = 0x7ffff7bc0e40, fc_name = 0x555555aad030 <_PyRuntime+21648>, fc_qualname = 0x555555aad030 <_PyRuntime+21648>, fc_code = 0x555555c848a0, fc_defaults = 0x0, fc_kwdefaults = 0x0, fc_closure = 0x0}
func = 0x7ffff7c02020
res = <optimized out>
#31 0x0000555555816a1a in run_eval_code_obj (tstate=0x555555ad0558 <_PyRuntime+166328>, co=0x555555c848a0, globals=0x7ffff79d0180, locals=0x7ffff79d0180) at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1741
v = <optimized out>
#32 0x0000555555812593 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7ffff79d0180, locals=0x7ffff79d0180, flags=<optimized out>, arena=<optimized out>) at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1762
tstate = 0x555555ad0558 <_PyRuntime+166328>
co = 0x555555c848a0
v = <optimized out>
#33 0x0000555555827930 in pyrun_file (fp=fp@entry=0x555555b13370, filename=filename@entry=0x7ffff7bc64b0, start=start@entry=257, globals=globals@entry=0x7ffff79d0180, locals=locals@entry=0x7ffff79d0180, closeit=closeit@entry=1, flags=0x7fffffffb418) at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:1657
arena = 0x7ffff7b4b610
mod = 0x555555c5b4b0
ret = <optimized out>
#34 0x00005555558272ce in _PyRun_SimpleFileObject (fp=0x555555b13370, filename=0x7ffff7bc64b0, closeit=1, flags=0x7fffffffb418) at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:440
m = 0x7ffff7c24360
d = 0x7ffff79d0180
v = <optimized out>
set_file_name = <optimized out>
ret = -1
done = <optimized out>
pyc = <optimized out>
#35 0x0000555555826ff4 in _PyRun_AnyFileObject (fp=0x555555b13370, filename=0x7ffff7bc64b0, closeit=1, flags=0x7fffffffb418) at /usr/local/src/conda/python-3.11.8/Python/pythonrun.c:79
decref_filename = 0
res = <optimized out>
#36 0x00005555558216f4 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7ffff7bc64b0, program_name=0x7ffff79ecc30) at /usr/local/src/conda/python-3.11.8/Modules/main.c:360
fp = <optimized out>
sb = {st_dev = 41, st_ino = 9280118274347409797, st_nlink = 1, st_mode = 33188, st_uid = 1000, st_gid = 1000, __pad0 = 0, st_rdev = 0, st_size = 16956, st_blksize = 65536, st_blocks = 40, st_atim = {tv_sec = 1713174193, tv_nsec = 722310000}, st_mtim = {tv_sec = 1713173906, tv_nsec = 510452000}, st_ctim = {tv_sec = 1713173906, tv_nsec = 510452000}, __unused = {0, 0, 0}}
cf = {cf_flags = 0, cf_feature_version = 11}
run = <optimized out>
fp = <optimized out>
sb = <optimized out>
cf = <optimized out>
run = <optimized out>
ch = <optimized out>
#37 pymain_run_file (config=0x555555ab65a0 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.8/Modules/main.c:379
filename = 0x7ffff7bc64b0
program_name = 0x7ffff79ecc30
res = <optimized out>
filename = <optimized out>
program_name = <optimized out>
res = <optimized out>
#38 pymain_run_python (exitcode=0x7fffffffb414) at /usr/local/src/conda/python-3.11.8/Modules/main.c:601
main_importer_path = <optimized out>
interp = 0x555555ab61d8 <_PyRuntime+58936>
config = 0x555555ab65a0 <_PyRuntime+59904>
error = <optimized out>
main_importer_path = <optimized out>
interp = <optimized out>
config = <optimized out>
error = <optimized out>
done = <optimized out>
path0 = <optimized out>
res = <optimized out>
#39 Py_RunMain () at /usr/local/src/conda/python-3.11.8/Modules/main.c:680
exitcode = 0
#40 0x00005555557e7a77 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.8/Modules/main.c:734
args = {argc = 8, use_bytes_argv = 1, bytes_argv = 0x7fffffffb668, wchar_argv = 0x0}
#41 0x00007ffff7cb7d90 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#42 0x00007ffff7cb7e40 in __libc_start_main () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#43 0x00005555557e791d in _start ()
No symbol table info available.
Other problem is that DARTS get stuck with no GPU utilization Running DMDASD on the same task does not stale And performance of DARTS is 5 times lower (in case it does not stale), compared to DMDASD on some fast to reproduce experiment: 5 Tflops/s against 25 Tflops/s.
Hi, Thanks for the traces. I'll investigate the double free errors. Regarding performances I would recommend you to use the following parameters to extract the best performance out of DARTS: STARPU_SCHED_READY=1 STARPU_SCHED=darts STARPU_NTASKS_THRESHOLD=10 STARPU_CUDA_PIPELINE=4 STARPU_MINIMUM_CLEAN_BUFFERS=0 STARPU_TARGET_CLEAN_BUFFERS=0 STARPU_NCPU=0 STARPU_NCUDA=$((NGPU)) STARPU_NOPENCL=0 ./your_application
Darts is not yet very good at balancing work between GPUs and CPUs and is thus often used in homogeneous settings. Disabling the CPUs and OPENCL allows that. You can use STARPU_NCUDA to adjust how many GPUs you will be using.
Fully disabling CPU workers is not possible yet, as data preparation code is not yet ported to GPU. However, once data is established training loop only uses GPUs while CPU workers do nothing.
STARPU_SCHED_READY=1 STARPU_SCHED=darts STARPU_NTASKS_THRESHOLD=10 STARPU_CUDA_PIPELINE=4 STARPU_MINIMUM_CLEAN_BUFFERS=0 STARPU_TARGET_CLEAN_BUFFERS=0 STARPU_NCPU=0 STARPU_NCUDA=$((NGPU)) STARPU_NOPENCL=0 ./your_application
This setup did not solve the problem for 4 GPUs: forward passes get 6 Tflops/s (up from 5 Tflops/s) and backward passes stale.
Steps to reproduce
I am trying to use DARTS scheduler from the latest master branch (commit 4131e05d441f6aa3004632c61e982c63f2496cb9 of Gitlab) and get the following error:
Full backtrace and config.log are here
At the same time, other schedulers, e.g, DMDASD, work without a problem