Open mcourteaux opened 1 year ago
IIRC, profiler_by_timer was intended for embedded platforms without threading support, which probably explains the thread-unsafety. We should probably throw a spinlock in there.
Ah, okay. I was initially incorrectly assuming that it was gonna use RDTSC to profile the pipeline. Is there a generator feature documentation page somewhere?
The description of each target flag is hiding in HalideRuntime.h. We should probably surface it better:
https://halide-lang.org/docs/_halide_runtime_8h.html#a2ccb96b3d427fff8f1d68cc5f1e92f3a
I have found several times that my application freezes when I use the builtin
profiler-by-timer
in Halide.My last instance was plugged into the debugger, and I analyzed it. It seems that upon the freeze, the loops to iterate over the linked lists do an infinite loop:
https://github.com/halide/Halide/blob/fca8d9606145d0c42caac1954c37d86fa291fea0/src/runtime/profiler_common.cpp#L50-L58
The linked list has a loop in it! For my case, it seems that the length of the loop was 4, yet I have many more pipelines in the application.
I am currently using "
profiler_by_timer
" as a Generator feature, so I'm not using sampling. I checked the active threads, and all are waiting, and no halide-sampling threads are running. Upon a second run, the loop is of length one (meaning:p->next == p
).Doing some extra debugging, I figured out that the "by-timer" is referring to a hardware timer, rather than what I thought it was: inline instrumentation of the pipeline using RDTSC-like clock measurements.
Doing some more analysis work, I have a hypothesis:
bill_func
performs some "bubbling up" (it reorders some elements in the linked list): https://github.com/halide/Halide/blob/fca8d9606145d0c42caac1954c37d86fa291fea0/src/runtime/profiler_common.cpp#L104-L109 From the sampling-thread, this is called while holding a lock, but from the hardware interrupt timer, the interrupt handlerprofiler_handler()
doesn't acquire the lock, but also eventually goes tobill_func
(via L34): https://github.com/halide/Halide/blob/fca8d9606145d0c42caac1954c37d86fa291fea0/src/runtime/posix_timer_profiler.cpp#L31-L44 I have no clue if this interrupt timer triggers all cores, or just one. If it does all, then for sure this seems like a bad idea.This hypothesis is somewhat-ish confirmed by the fact that changing to the
profiler
feature (as opposed toprofiler_by_timer
), doesn't trigger this issue.