starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
61 stars 12 forks source link

StarPU-1.4.7 with `STARPU_PROFILING=0` asserts on `measured<0` #55

Closed Muxas closed 1 month ago

Muxas commented 1 month ago

Steps to reproduce

It happens rarely, on occasion. I have no reliable way to repeat it. However, if some test executable is run many times with STARPU_SCHED=dmda (any history-based scheduler fits), the issue appears. It happens with StarPU-1.4 branch (version 1.4.7) but not with StarPU-1.3 branch (version 1.3.11).

Preliminary checking with gdb showed worker->cl_start is suspicious. Comparing 1.3.11 and 1.4.7 versions I found strange guards on setting worker->cl_start value. Simply speaking, all the guards are about if profiling info is required. Moreover, setting STARPU_PROFILING=1 completely eliminates the issue. It might be some trouble with mutexes, but I did not dig further into the issue.

Obtained behavior

After running test executable many times the following output might appear:

/usr/local/lib/libstarpu-1.4.so.6(+0x1346c5)[0x7f99196206c5]
/usr/local/lib/libstarpu-1.4.so.6(_starpu_cpu_driver_run_once+0x36a)[0x7f991962138a]
/usr/local/lib/libstarpu-1.4.so.6(+0x13577d)[0x7f991962177d]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f991f66dac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f991f6fea04]

[starpu][_starpu_driver_update_job_feedback][assert failure] measured=-2261732866398.091797

python: drivers/driver_common/driver_common.c:247: _starpu_driver_update_job_feedback: Assertion `0 && "measured >= 0"' failed.

Here is the bt full option of gdb:

(gdb) bt full
#0  0x00007f42610e1720 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007f42610e171b in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#2  0x00007f42610f2e96 in __assert_fail () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#3  0x00007f425b0729ca in _starpu_driver_update_job_feedback (j=j@entry=0x55c64ccf5800, 
    worker=worker@entry=0x7f425b179620 <_starpu_config+8384>, perf_arch=perf_arch@entry=0x7f425b179668 <_starpu_config+8456>, 
    profiling=profiling@entry=0) at drivers/driver_common/driver_common.c:247
        measured = -2261866932365.2588
        profiling_info = 0x0
        measured_ts = {tv_sec = -2261867, tv_nsec = 67634741}
        workerid = 1
        cl = 0x7f425c366fa0 <nntile::starpu::conv2d_inplace::codelet_fp32>
        calibrate_model = <optimized out>
        updated = 0
        __func__ = "_starpu_driver_update_job_feedback"
        __PRETTY_FUNCTION__ = "_starpu_driver_update_job_feedback"
#4  0x00007f425b0f46c5 in execute_job_on_cpu (perf_arch=0x7f425b179668 <_starpu_config+8456>, rank=0, 
    cpu_args=0x7f425b179620 <_starpu_config+8384>, worker_task=0x55c64cdadfd0, j=0x55c64ccf5800) at drivers/cpu/driver_cpu.c:466
        is_parallel_task = 0
        func = 0x7f425b980128 <nntile::starpu::conv2d_inplace::cpu<nntile::fp32_t>(void**, void*)>
        profiling = 0
        task = <optimized out>
        cl = <optimized out>
        pi = {conf = 0x0, event_type = starpu_prof_tool_event_end_cpu_exec, starpu_version = {1, 4, 7}, thread_id = -1805654464, 
          worker_id = 0, device_number = 0, driver_type = starpu_prof_tool_driver_cpu, memnode = 4294967295, 
          bytes_to_transfer = 0, bytes_transfered = 0, 
          fun_ptr = 0x7f425b980128 <nntile::starpu::conv2d_inplace::cpu<nntile::fp32_t>(void**, void*)>}
        devid = 0
        is_parallel_task = <optimized out>
        profiling = <optimized out>
        task = <optimized out>
        cl = <optimized out>
        pi = <optimized out>
        devid = <optimized out>
--Type <RET> for more, q to quit, c to continue without paging--c
        __PRETTY_FUNCTION__ = "execute_job_on_cpu"
        __func__ = "execute_job_on_cpu"
        func = <optimized out>
        __ret = <optimized out>
        __ptrs = <optimized out>
        __n = <optimized out>
        p_ret = <optimized out>
        __ptrs = <optimized out>
        __n = <optimized out>
        __ptrs = <optimized out>
        __n = <optimized out>
        __args = <optimized out>
        __args = <optimized out>
        __args = <optimized out>
        __args = <optimized out>
        p_ret = <optimized out>
        __ptrs = <optimized out>
        __n = <optimized out>
        __args = <optimized out>
        __args = <optimized out>
#5  _starpu_cpu_driver_execute_task (cpu_worker=cpu_worker@entry=0x7f425b179620 <_starpu_config+8384>, task=task@entry=0x55c64cdadfd0, j=j@entry=0x55c64ccf5800) at drivers/cpu/driver_cpu.c:531
        res = <optimized out>
        rank = 0
        is_parallel_task = 0
        perf_arch = 0x7f425b179668 <_starpu_config+8456>
        __func__ = "_starpu_cpu_driver_execute_task"
        __PRETTY_FUNCTION__ = "_starpu_cpu_driver_execute_task"
        __ptrs = <optimized out>
        __n = <optimized out>
#6  0x00007f425b0f538a in _starpu_cpu_driver_run_once (cpu_worker=cpu_worker@entry=0x7f425b179620 <_starpu_config+8384>) at drivers/cpu/driver_cpu.c:596
        ret = <optimized out>
        memnode = 0
        workerid = 1
        pi = {conf = 0x0, event_type = starpu_prof_tool_event_end_transfer, starpu_version = {1, 4, 7}, thread_id = -1805654464, worker_id = 1, device_number = 1, driver_type = starpu_prof_tool_driver_cpu, memnode = 0, bytes_to_transfer = 3, bytes_transfered = 3, fun_ptr = 0x0}
        res = <optimized out>
        j = 0x55c64ccf5800
        task = 0x0
        pending_task = 0x55c64cdadfd0
        rank = 0
        __func__ = "_starpu_cpu_driver_run_once"
        continuation_wake_up = <optimized out>
        __PRETTY_FUNCTION__ = "_starpu_cpu_driver_run_once"
#7  0x00007f425b0f577d in _starpu_cpu_worker (arg=0x7f425b179620 <_starpu_config+8384>) at drivers/cpu/driver_cpu.c:714
        worker = 0x7f425b179620 <_starpu_config+8384>
        pi = {conf = 0x0, event_type = starpu_prof_tool_event_start_transfer, starpu_version = {1, 4, 7}, thread_id = -1805654464, worker_id = 1, device_number = 1, driver_type = starpu_prof_tool_driver_cpu, memnode = 0, bytes_to_transfer = 0, bytes_transfered = 0, fun_ptr = 0x0}
#8  0x00007f426114dac3 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#9  0x00007f42611dea04 in clone () from /usr/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Version of StarPU

The problem appears with StarPU-1.4.7 only with STARPU_PROFILING=0.

Muxas commented 1 month ago

Just a quick addition: it happens only during lots of very small tests (e.g., copy just several bytes).

sthibaul commented 1 month ago
        measured = -2261866932365.2588
        measured_ts = {tv_sec = -2261867, tv_nsec = 67634741}

This is very bogus. Could you print worker->cl_end and worker->cl_start too?

Comparing 1.3.11 and 1.4.7 versions I found strange guards on setting worker->cl_start value

I don't really see much difference between the two, except the addition of !_starpu_perf_counter_paused(), but it's not supposed to be 0 by default. Could you print _starpu_config.perf_counter_pause_depth to make sure?

Muxas commented 1 month ago

Indeed, perf counter was paused:

Thread 11 "CPU 0" hit Breakpoint 1, 0x00007faee8934720 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#1  0x00007faee893471b in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#2  0x00007faee8945e96 in __assert_fail () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#3  0x00007faee34d79ca in _starpu_driver_update_job_feedback (j=j@entry=0x55d481a0b4b0, 
    worker=worker@entry=0x7faee35ddee0 <_starpu_config+6528>, perf_arch=perf_arch@entry=0x7faee35ddf28 <_starpu_config+6600>, 
    profiling=profiling@entry=0) at drivers/driver_common/driver_common.c:247
247 drivers/driver_common/driver_common.c: No such file or directory.
(gdb) l
242 in drivers/driver_common/driver_common.c
(gdb) p worker->cl_start
$1 = {tv_sec = 2420647, tv_nsec = 337484598}
(gdb) p worker->cl_end  
$2 = {tv_sec = 1, tv_nsec = 270766077}
(gdb) p _starpu_config.perf_counter_pause_depth
$3 = 1
(gdb)
sthibaul commented 1 month ago

I guess it's calibrate_model which by bad luck switches to 1 between taking cl_start and cl_end, I have pushed a fix https://gitlab.inria.fr/starpu/starpu/-/commit/9508a660dbbf833dbe0cc49e13ba25f66f4f6261 it will be available in the 1.4 branch by tomorrow