Closed Muxas closed 3 months ago
Just a quick addition: it happens only during lots of very small tests (e.g., copy just several bytes).
measured = -2261866932365.2588
measured_ts = {tv_sec = -2261867, tv_nsec = 67634741}
This is very bogus. Could you print worker->cl_end
and worker->cl_start
too?
Comparing 1.3.11 and 1.4.7 versions I found strange guards on setting
worker->cl_start
value
I don't really see much difference between the two, except the addition of !_starpu_perf_counter_paused()
, but it's not supposed to be 0 by default. Could you print _starpu_config.perf_counter_pause_depth
to make sure?
Indeed, perf counter was paused:
Thread 11 "CPU 0" hit Breakpoint 1, 0x00007faee8934720 in abort () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#1 0x00007faee893471b in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#2 0x00007faee8945e96 in __assert_fail () from /usr/lib/x86_64-linux-gnu/libc.so.6
(gdb) up
#3 0x00007faee34d79ca in _starpu_driver_update_job_feedback (j=j@entry=0x55d481a0b4b0,
worker=worker@entry=0x7faee35ddee0 <_starpu_config+6528>, perf_arch=perf_arch@entry=0x7faee35ddf28 <_starpu_config+6600>,
profiling=profiling@entry=0) at drivers/driver_common/driver_common.c:247
247 drivers/driver_common/driver_common.c: No such file or directory.
(gdb) l
242 in drivers/driver_common/driver_common.c
(gdb) p worker->cl_start
$1 = {tv_sec = 2420647, tv_nsec = 337484598}
(gdb) p worker->cl_end
$2 = {tv_sec = 1, tv_nsec = 270766077}
(gdb) p _starpu_config.perf_counter_pause_depth
$3 = 1
(gdb)
I guess it's calibrate_model
which by bad luck switches to 1 between taking cl_start
and cl_end
, I have pushed a fix https://gitlab.inria.fr/starpu/starpu/-/commit/9508a660dbbf833dbe0cc49e13ba25f66f4f6261 it will be available in the 1.4 branch by tomorrow
Steps to reproduce
It happens rarely, on occasion. I have no reliable way to repeat it. However, if some test executable is run many times with
STARPU_SCHED=dmda
(any history-based scheduler fits), the issue appears. It happens with StarPU-1.4 branch (version 1.4.7) but not with StarPU-1.3 branch (version 1.3.11).Preliminary checking with
gdb
showedworker->cl_start
is suspicious. Comparing 1.3.11 and 1.4.7 versions I found strange guards on settingworker->cl_start
value. Simply speaking, all the guards are about if profiling info is required. Moreover, settingSTARPU_PROFILING=1
completely eliminates the issue. It might be some trouble with mutexes, but I did not dig further into the issue.Obtained behavior
After running test executable many times the following output might appear:
Here is the
bt full
option ofgdb
:Version of StarPU
The problem appears with StarPU-1.4.7 only with
STARPU_PROFILING=0
.