halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.9k stars 1.07k forks source link

Debug symbols #152

Closed jrk closed 11 years ago

jrk commented 11 years ago

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams @zvookin @steven-johnson?

abadams commented 11 years ago

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookinhttps://github.com/zvookin @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152 .

abadams commented 11 years ago

Also valgrind + HL_TRACE=3 is a really great combo for diagnosing occasional segfaults in generated code. The last few segfault bugs didn't reproduce for me, but valgrind immediately pointed me to the problem. E.g. the reorder_storage call was a red herring. It just made the bad reads more likely to cross a page boundary.

On Tue, Sep 3, 2013 at 6:40 PM, Andrew Adams andrew.b.adams@gmail.comwrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookinhttps://github.com/zvookin @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152 .

abadams commented 11 years ago

One last point: Matt figured this out for ispc, so that can act as sample code for us.

On Tue, Sep 3, 2013 at 6:43 PM, Andrew Adams andrew.b.adams@gmail.comwrote:

Also valgrind + HL_TRACE=3 is a really great combo for diagnosing occasional segfaults in generated code. The last few segfault bugs didn't reproduce for me, but valgrind immediately pointed me to the problem. E.g. the reorder_storage call was a red herring. It just made the bad reads more likely to cross a page boundary.

On Tue, Sep 3, 2013 at 6:40 PM, Andrew Adams andrew.b.adams@gmail.comwrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookinhttps://github.com/zvookin @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152 .

jrk commented 11 years ago

Indeed, the awkward semantic match is what I remember from looking before.

The other thing I'm finding (admittedly only when chasing what are mostly actual compiler bugs) is that I'd like to know where the hell I am when I hit a segfault or bus error. Trace helps with this, but not in every situation. (Among other things, I currently have a job still running after an hour waiting for the segfault to trigger with trace set high enough and piped through tail.) Valgrind is a good suggestion.

The thing that always comes up first, though, is how useful it would be to have some sort of a call stack. HL_TRACE=3 usually provides enough information to reconstruct it, but when the world is breaking it's hard to trust in addition to being quite slow and obtuse to a newcomer (which includes me every time I have to remember what the print format is). An actual call stack for the failing instruction would be more precise in case of weird failures, easier to parse, and more approachable. This would be even more useful for people writing extern image functions.

All that said, is the only way to get call stack information out to emit actual function calls with real frames, etc.?

On Sep 3, 2013, at 9:40 PM, Andrew Adams notifications@github.com wrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookinhttps://github.com/zvookin @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152 .

— Reply to this email directly or view it on GitHub.

abadams commented 11 years ago

What's the equivalent of a call stack? Produce/consume nodes? We could augment codegen to set some kind of debugging context as we enter and exit them perhaps, but I'm not sure we can correctly replicate what gdb understands as the call stack. HL_TRACE=2 already emits info on every produce/consume barrier, so maybe tracing should just be better.

On Tue, Sep 3, 2013 at 7:01 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Indeed, the awkward semantic match is what I remember from looking before.

The other thing I'm finding (admittedly only when chasing what are mostly actual compiler bugs) is that I'd like to know where the hell I am when I hit a segfault or bus error. Trace helps with this, but not in every situation. (Among other things, I currently have a job still running after an hour waiting for the segfault to trigger with trace set high enough and piped through tail.) Valgrind is a good suggestion.

The thing that always comes up first, though, is how useful it would be to have some sort of a call stack. HL_TRACE=3 usually provides enough information to reconstruct it, but when the world is breaking it's hard to trust in addition to being quite slow and obtuse to a newcomer (which includes me every time I have to remember what the print format is). An actual call stack for the failing instruction would be more precise in case of weird failures, easier to parse, and more approachable. This would be even more useful for people writing extern image functions.

All that said, is the only way to get call stack information out to emit actual function calls with real frames, etc.?

On Sep 3, 2013, at 9:40 PM, Andrew Adams notifications@github.com wrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookin< https://github.com/zvookin> @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23761194 .

jrk commented 11 years ago

Yes, though I realize it may be awkwardly related to the actual produce-consume structure. Logically, at any point, you're computing something, in the process of computing something else, and recursively so up to the currently enclosing root function. (Or, in the case of a reduction, the same is true, but there are two sequential functions: the initialization and the update.)

All that said, I'm ok if the pragmatic answer is simply "make tracing better (as feasible)."

On Sep 3, 2013, at 10:05 PM, Andrew Adams notifications@github.com wrote:

What's the equivalent of a call stack? Produce/consume nodes? We could augment codegen to set some kind of debugging context as we enter and exit them perhaps, but I'm not sure we can correctly replicate what gdb understands as the call stack. HL_TRACE=2 already emits info on every produce/consume barrier, so maybe tracing should just be better.

  • Andrew

On Tue, Sep 3, 2013 at 7:01 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Indeed, the awkward semantic match is what I remember from looking before.

The other thing I'm finding (admittedly only when chasing what are mostly actual compiler bugs) is that I'd like to know where the hell I am when I hit a segfault or bus error. Trace helps with this, but not in every situation. (Among other things, I currently have a job still running after an hour waiting for the segfault to trigger with trace set high enough and piped through tail.) Valgrind is a good suggestion.

The thing that always comes up first, though, is how useful it would be to have some sort of a call stack. HL_TRACE=3 usually provides enough information to reconstruct it, but when the world is breaking it's hard to trust in addition to being quite slow and obtuse to a newcomer (which includes me every time I have to remember what the print format is). An actual call stack for the failing instruction would be more precise in case of weird failures, easier to parse, and more approachable. This would be even more useful for people writing extern image functions.

All that said, is the only way to get call stack information out to emit actual function calls with real frames, etc.?

On Sep 3, 2013, at 9:40 PM, Andrew Adams notifications@github.com wrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookin< https://github.com/zvookin> @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23761194 .

— Reply to this email directly or view it on GitHub.

zvookin commented 11 years ago

I expect attaching information to the variable names will help a lot in getting context and is easier to do. (In fact the info is sort of already there in the assembly, but I don't think it gets pushed through to DWARF or similar.)

If all or most of the conceptual variables have a representation in the debug info, the first step in answering "Where did this crash?" becomes "What are the variables that are in scope?" Since the variables contain the Func name and loop index, this can be used to figure out where the crash is relative to the Halide source code. If we can get some ordering so we can tell which loop induction variables are inner and which are outer, then it is likely possible to reconstruct a rough stack trace from that data, even though we don't really have a stack of program counter values and it isn't clear if program counters even correspond to Funcs particularly well.

Part of the above depends on the debug info being able to represent Halide variables that don't trivially map to a register or memory location. I believe modern debug file formats can do this to some extent. (E.g. express a logical variable as an expression of some set of registers and memory locations.) Obviously this does not help if the entire concept of the variable disappeared from the lowered code, but I'm having trouble thinking of a case where that really happens and is important.

As we deal with parallelism more, it would probably be good to set something like a thread name per thread or task as well.

Re: tracing, I will once again mention dtrace as my ideal model for this. Alas I've given up on actually using dtrace itself, but the relevant characteristics are: 1) Really fast, threadsafe, logging to in-memory buffers. 2) Some data reduction before persisting to more permanent storage. 3) A query language that affects both step 2 and allows analyzing the final data efficiently.

If using JIT, we can also take advantage of dtrace's "only pay the cost if actually tracing" property as well. Ideally that would involve a tracing language rather than just a tracing level. Making this actually productive certainly requires some thinking, but it also requires a bunch of simple tooling type stuff like dealing with getting traces off mobile devices, etc.

The data reduction in step 2 above may or may not be a good idea for cases like Jonathan is discussing where lots of generated tests are being run and a crash may not be reproducible. But one can still run with lots of tracing enabled and do the analysis only if there is a crash. The advantage would be that tracing is a heck of a lot faster than text and printf and the trace logs are smaller. (Anyone else think it is silly that emacs warns you when opening a 250MB file these days :-))

-Z-

On Tue, Sep 3, 2013 at 7:31 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Yes, though I realize it may be awkwardly related to the actual produce-consume structure. Logically, at any point, you're computing something, in the process of computing something else, and recursively so up to the currently enclosing root function. (Or, in the case of a reduction, the same is true, but there are two sequential functions: the initialization and the update.)

All that said, I'm ok if the pragmatic answer is simply "make tracing better (as feasible)."

On Sep 3, 2013, at 10:05 PM, Andrew Adams notifications@github.com wrote:

What's the equivalent of a call stack? Produce/consume nodes? We could augment codegen to set some kind of debugging context as we enter and exit them perhaps, but I'm not sure we can correctly replicate what gdb understands as the call stack. HL_TRACE=2 already emits info on every produce/consume barrier, so maybe tracing should just be better.

  • Andrew

On Tue, Sep 3, 2013 at 7:01 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Indeed, the awkward semantic match is what I remember from looking before.

The other thing I'm finding (admittedly only when chasing what are mostly actual compiler bugs) is that I'd like to know where the hell I am when I hit a segfault or bus error. Trace helps with this, but not in every situation. (Among other things, I currently have a job still running after an hour waiting for the segfault to trigger with trace set high enough and piped through tail.) Valgrind is a good suggestion.

The thing that always comes up first, though, is how useful it would be to have some sort of a call stack. HL_TRACE=3 usually provides enough information to reconstruct it, but when the world is breaking it's hard to trust in addition to being quite slow and obtuse to a newcomer (which includes me every time I have to remember what the print format is). An actual call stack for the failing instruction would be more precise in case of weird failures, easier to parse, and more approachable. This would be even more useful for people writing extern image functions.

All that said, is the only way to get call stack information out to emit actual function calls with real frames, etc.?

On Sep 3, 2013, at 9:40 PM, Andrew Adams notifications@github.com wrote:

I looked at it recently and gave up, because most of the debug metadata was things we don't have access to: e.g. file and line numbers. It was a really awkward fit that didn't look like it would provide enough value for me to continue. The one thing that I decided would be beneficial was tying llvm variable names, which are already somewhat meaningful for us, to debug symbol names. E.g. in gdb I could ask for the value of f.x to see where I am in the loop.

On Tue, Sep 3, 2013 at 6:35 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Chasing down more bugs in autotuner-generated schedules, I wonder: is it high time we looked into adding debug symbols to the codegen?

I haven't thought seriously about this in recent memory, but last I remember doing so it seemed to require some design thought to figure out how we wanted to map the scheduled code structure into LLVM's debug representation.

Thoughts, @abadams https://github.com/abadams @zvookin< https://github.com/zvookin> @steven-johnson https://github.com/steven-johnson?

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152#issuecomment-23761194> .

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23762051 .

jrk commented 11 years ago

I like this on all fronts.

As I've said before, I'm also a big fan of DTrace as a model.

It feels a bit like the trace filtering, etc. can quickly become a whole research project unto itself (along the lines of this: http://dl.acm.org/citation.cfm?doid=1073204.1073213), which is intimidating but also potentially interesting.

zvookin commented 11 years ago

Yes, DTrace is a lot more lovable when it already exists compared to when one has to write it from scratch :-)

I'll throw out a concrete proposal for others to poke holes in:

TRace messages have a fixed size binary format: 1) Type field (may not be necessary) 2) Timestamp 3) Thread or task ID 4) A context ID which can be used to group multiple trace messages 5) fixed size list of values along dimensions (e.g. x, y, z) 7) A single pointer value, for e.g. logging host ptr, malloc/free, etc. 8) Func or context name, must be a constant, zero terminated, string 9) Pointer to a constant, zero terminated, string, which may be a format string using the above values

As an example of use, a loop start would write a message with the extents in field 5 and each trip could write a message with the current values in field 5. The type field would distinguish the two. (One of course wants some way to control the granularity of writing trip messages.)

This should be doable in 64 bytes. (1-5 are 32-bit values, 6 dimensions in 5, 7-9 are 64-bit values. One can argue some of the 32-bit values should be 64-bit. Compression is also definitely possible and might be worth considering if it makes common cases guaranteed to fit in memory for entire runs.)

Logging is to per-thread buffer blocks. Locking only happens when the current buffer is exhausted. The system maintains a list of fixed size buffer blocks. At crash time, there must be cleanup code to find and flush all the pending buffer data, which is kind of ugly system dependent code, but it should be doable. (There is a single thread which aggregates trace buffers from execution threads and either writes them to disk or a socket.

Making sure all traces get synced on a crash feels like the major weakness of the design, but I can't think of a way to get performance if we synchronously write trace messages. One possibility is to write the trace logs into a memory mapped area that is also held by another process which does the log aggregation. That is also pretty system dependent. The literature probably has lots of ideas here :-)

On the back side, there is a small app with a simple query language that can do basic analysis from the above and can either run on files or be a server accepting packets over a socket. (The file format and socket protocol have to be a bit more than just a sequence of 64-byte packets as they have to record the constant strings the first time they appear in the stream. One could insist that the strings are registered all at once at initialization time, but I'd rather have the recording thread maintain a map of them and write a special message when a new string is introduced.)

The first two clients here would probably be HalideProf and the visualizer project. Though perhaps it is quick and easy enough to do to support Jonathan's current work. Analyzing the data well likely depends on the timestamps being pretty accurate across threads, RDTSC does this well on x86 (though requires a 64-bit field, contrary to what I said above). This turns out to be tricky on ARM as the cycle counter is often not readable from user space. Other solutions may be slow or otherwise problematic. (E.g. gettimeofday, a single global counter that doesn't correspond to elapsed time but captures ordering, or per-thread ordering counters than don't maintain ordering across threads.)

I think we can even get useful tracing from code running on GPUs, but it requires more thinking. (PTX is likely powerful enough to do this and it can support performance analysis. Not sure how it will work for crash analysis, but I'm already looking at taking code from cuda-gdb to at least get a minimal indication of where we crashed and that API does allow getting events back from the GPU. Question is whether we can recover the contents of an allocated block from a context that has experienced a bad memory access in a kernel.)

One possible extension here would be a way to write a buffer_t and its contents to the trace log. This would replace debug_to_file. (Implementation would likely have a secondary place to write the image data and write a standard message that points to the data.)

-Z-

On Wed, Sep 4, 2013 at 10:58 AM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

I like this on all fronts.

As I've said before, I'm also a big fan of DTrace as a model.

It feels a bit like the trace filtering, etc. can quickly become a whole research project unto itself (along the lines of this: http://dl.acm.org/citation.cfm?doid=1073204.1073213), which is intimidating but also potentially interesting.

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23810214 .

abadams commented 11 years ago

I'm currently adding some better runtime hooks for tracing. I'll make sure enough information gets passed to the functions to implement your proposal. It all looks doable, though some parts may be tricky (e.g. computing the correct pointer for loads and stores is annoying given how tracing currently integrates into lowering). I'll see how it shakes out.

On Wed, Sep 4, 2013 at 11:38 AM, Zalman Stern notifications@github.comwrote:

Yes, DTrace is a lot more lovable when it already exists compared to when one has to write it from scratch :-)

I'll throw out a concrete proposal for others to poke holes in:

TRace messages have a fixed size binary format: 1) Type field (may not be necessary) 2) Timestamp 3) Thread or task ID 4) A context ID which can be used to group multiple trace messages 5) fixed size list of values along dimensions (e.g. x, y, z) 7) A single pointer value, for e.g. logging host ptr, malloc/free, etc. 8) Func or context name, must be a constant, zero terminated, string 9) Pointer to a constant, zero terminated, string, which may be a format string using the above values

As an example of use, a loop start would write a message with the extents in field 5 and each trip could write a message with the current values in field 5. The type field would distinguish the two. (One of course wants some way to control the granularity of writing trip messages.)

This should be doable in 64 bytes. (1-5 are 32-bit values, 6 dimensions in 5, 7-9 are 64-bit values. One can argue some of the 32-bit values should be 64-bit. Compression is also definitely possible and might be worth considering if it makes common cases guaranteed to fit in memory for entire runs.)

Logging is to per-thread buffer blocks. Locking only happens when the current buffer is exhausted. The system maintains a list of fixed size buffer blocks. At crash time, there must be cleanup code to find and flush all the pending buffer data, which is kind of ugly system dependent code, but it should be doable. (There is a single thread which aggregates trace buffers from execution threads and either writes them to disk or a socket.

Making sure all traces get synced on a crash feels like the major weakness of the design, but I can't think of a way to get performance if we synchronously write trace messages. One possibility is to write the trace logs into a memory mapped area that is also held by another process which does the log aggregation. That is also pretty system dependent. The literature probably has lots of ideas here :-)

On the back side, there is a small app with a simple query language that can do basic analysis from the above and can either run on files or be a server accepting packets over a socket. (The file format and socket protocol have to be a bit more than just a sequence of 64-byte packets as they have to record the constant strings the first time they appear in the stream. One could insist that the strings are registered all at once at initialization time, but I'd rather have the recording thread maintain a map of them and write a special message when a new string is introduced.)

The first two clients here would probably be HalideProf and the visualizer project. Though perhaps it is quick and easy enough to do to support Jonathan's current work. Analyzing the data well likely depends on the timestamps being pretty accurate across threads, RDTSC does this well on x86 (though requires a 64-bit field, contrary to what I said above). This turns out to be tricky on ARM as the cycle counter is often not readable from user space. Other solutions may be slow or otherwise problematic. (E.g. gettimeofday, a single global counter that doesn't correspond to elapsed time but captures ordering, or per-thread ordering counters than don't maintain ordering across threads.)

I think we can even get useful tracing from code running on GPUs, but it requires more thinking. (PTX is likely powerful enough to do this and it can support performance analysis. Not sure how it will work for crash analysis, but I'm already looking at taking code from cuda-gdb to at least get a minimal indication of where we crashed and that API does allow getting events back from the GPU. Question is whether we can recover the contents of an allocated block from a context that has experienced a bad memory access in a kernel.)

One possible extension here would be a way to write a buffer_t and its contents to the trace log. This would replace debug_to_file. (Implementation would likely have a secondary place to write the image data and write a standard message that points to the data.)

-Z-

On Wed, Sep 4, 2013 at 10:58 AM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

I like this on all fronts.

As I've said before, I'm also a big fan of DTrace as a model.

It feels a bit like the trace filtering, etc. can quickly become a whole research project unto itself (along the lines of this: http://dl.acm.org/citation.cfm?doid=1073204.1073213), which is intimidating but also potentially interesting.

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152#issuecomment-23810214> .

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23813221 .

zvookin commented 11 years ago

I was hoping to log pointers at very coarse granularity, e.g. when Allocation objects are constructed and destructed, so that when a crash occurs, previous trace data can be used to place the faulting address with respect to allocated memory. (E.g. one writes a "keep track of extant and historic memory regions" function in the query language.)

The above model would require reconstructing the length of the allocation from the extents in the trace message. We could also add another message type dependent 64-bit value I suppose.

-Z-

On Wed, Sep 4, 2013 at 12:55 PM, Andrew Adams notifications@github.comwrote:

I'm currently adding some better runtime hooks for tracing. I'll make sure enough information gets passed to the functions to implement your proposal. It all looks doable, though some parts may be tricky (e.g. computing the correct pointer for loads and stores is annoying given how tracing currently integrates into lowering). I'll see how it shakes out.

On Wed, Sep 4, 2013 at 11:38 AM, Zalman Stern notifications@github.comwrote:

Yes, DTrace is a lot more lovable when it already exists compared to when one has to write it from scratch :-)

I'll throw out a concrete proposal for others to poke holes in:

TRace messages have a fixed size binary format: 1) Type field (may not be necessary) 2) Timestamp 3) Thread or task ID 4) A context ID which can be used to group multiple trace messages 5) fixed size list of values along dimensions (e.g. x, y, z) 7) A single pointer value, for e.g. logging host ptr, malloc/free, etc. 8) Func or context name, must be a constant, zero terminated, string 9) Pointer to a constant, zero terminated, string, which may be a format string using the above values

As an example of use, a loop start would write a message with the extents in field 5 and each trip could write a message with the current values in field 5. The type field would distinguish the two. (One of course wants some way to control the granularity of writing trip messages.)

This should be doable in 64 bytes. (1-5 are 32-bit values, 6 dimensions in 5, 7-9 are 64-bit values. One can argue some of the 32-bit values should be 64-bit. Compression is also definitely possible and might be worth considering if it makes common cases guaranteed to fit in memory for entire runs.)

Logging is to per-thread buffer blocks. Locking only happens when the current buffer is exhausted. The system maintains a list of fixed size buffer blocks. At crash time, there must be cleanup code to find and flush all the pending buffer data, which is kind of ugly system dependent code, but it should be doable. (There is a single thread which aggregates trace buffers from execution threads and either writes them to disk or a socket.

Making sure all traces get synced on a crash feels like the major weakness of the design, but I can't think of a way to get performance if we synchronously write trace messages. One possibility is to write the trace logs into a memory mapped area that is also held by another process which does the log aggregation. That is also pretty system dependent. The literature probably has lots of ideas here :-)

On the back side, there is a small app with a simple query language that can do basic analysis from the above and can either run on files or be a server accepting packets over a socket. (The file format and socket protocol have to be a bit more than just a sequence of 64-byte packets as they have to record the constant strings the first time they appear in the stream. One could insist that the strings are registered all at once at initialization time, but I'd rather have the recording thread maintain a map of them and write a special message when a new string is introduced.)

The first two clients here would probably be HalideProf and the visualizer project. Though perhaps it is quick and easy enough to do to support Jonathan's current work. Analyzing the data well likely depends on the timestamps being pretty accurate across threads, RDTSC does this well on x86 (though requires a 64-bit field, contrary to what I said above). This turns out to be tricky on ARM as the cycle counter is often not readable from user space. Other solutions may be slow or otherwise problematic. (E.g. gettimeofday, a single global counter that doesn't correspond to elapsed time but captures ordering, or per-thread ordering counters than don't maintain ordering across threads.)

I think we can even get useful tracing from code running on GPUs, but it requires more thinking. (PTX is likely powerful enough to do this and it can support performance analysis. Not sure how it will work for crash analysis, but I'm already looking at taking code from cuda-gdb to at least get a minimal indication of where we crashed and that API does allow getting events back from the GPU. Question is whether we can recover the contents of an allocated block from a context that has experienced a bad memory access in a kernel.)

One possible extension here would be a way to write a buffer_t and its contents to the trace log. This would replace debug_to_file. (Implementation would likely have a secondary place to write the image data and write a standard message that points to the data.)

-Z-

On Wed, Sep 4, 2013 at 10:58 AM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

I like this on all fronts.

As I've said before, I'm also a big fan of DTrace as a model.

It feels a bit like the trace filtering, etc. can quickly become a whole research project unto itself (along the lines of this: http://dl.acm.org/citation.cfm?doid=1073204.1073213), which is intimidating but also potentially interesting.

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152#issuecomment-23810214> .

— Reply to this email directly or view it on GitHub< https://github.com/halide/Halide/issues/152#issuecomment-23813221> .

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23818958 .

jrk commented 11 years ago

Tiny clarification:

8) Func or context name, must be a constant, zero terminated, string

This would presumably be a pointer to a global string, just like the format string in (9), as well, right?

jrk commented 11 years ago

Also conceivably relevant, from the creator of Protocol Buffers: http://kentonv.github.io/capnproto/. C++ is the first-class language target, but Python is also well-supported already. Only gotcha is that the runtime requires c++-11, which, as ever, is a bitch on Windows.

zvookin commented 11 years ago

Yes, the strings must be constant and have a lifetime that covers all calls to the logging package. We probably want to add a call to reset the map that tracks these to deal with JIT and dynamic loading however.

The idea is the log writer can write the string to the persisted log the first time the pointer value appears and then write the pointer value itself from then on. Or the log format can expand to variable length records with the full string.

Note I'm assuming the pointer to string will only be checked and expanded in a single-threaded log writer. Otherwise we have to have a concurrent map, which is likely a bottleneck.

-Z-

On Wed, Sep 4, 2013 at 1:22 PM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Tiny clarification:

8) Func or context name, must be a constant, zero terminated, string

This would presumably be a pointer to a global string, just like the format string in (9), as well, right?

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-23820908 .

jrk commented 11 years ago

Crazy (and potentially impractical) idea: using this infrastructure, could we provide a debug mode (selected at compilation or load time, perhaps only available in the JIT) which:

  1. Registered signal handlers for memory errors and the like.
  2. Similarly hooked the assert path.
  3. Tracked just enough tracing information to print a user-friendly stack trace like thing showing the user where things blew up, if they did (and then still exited with an error code).

I'm wondering if we could provide something approaching the out-of-the-box approachability of Python/Java/whathaveyou when there are errors.

Counter argument: these sorts of errors shouldn't be visible to end-users most of the time; they should be viewed as compiler bugs, which we should be debugging, and we don't need as much hand-holding.

Counter-counter argument, 1: some of these bugs are fundamentally user errors (exhausting memory or similar resources).

Counter-counter argument, 2: from a purely practical perspective, so long as there continue to be explosions, providing some more detailed feedback makes it seem less unfriendly to new users, and gives them something much more useful to report as a bug.

abadams commented 11 years ago

Funnily enough, I was also thinking about this (you know how much I like signal handlers). I was planning to write a test that uses tracing to do this as a proof-of-concept, once the new tracing stuff is done. I'm halfway through it, but have shelved it for now to work on the new performance regression.

On Mon, Sep 9, 2013 at 9:30 AM, Jonathan Ragan-Kelley < notifications@github.com> wrote:

Crazy (and potentially impractical) idea: using this infrastructure, could we provide a debug mode (selected at compilation or load time, perhaps only available in the JIT) which:

  1. Registered signal handlers for memory errors and the like.
  2. Similarly hooked the assert path.
  3. Tracked just enough tracing information to print a user-friendly stack trace like thing showing the user where things blew up, if they did (and then still exited with an error code).

I'm wondering if we could provide something approaching the out-of-the-box approachability of Python/Java/whathaveyou when there are errors.

Counter argument: these sorts of errors shouldn't be visible to end-users most of the time; they should be viewed as compiler bugs, which we should be debugging, and we don't need as much hand-holding.

Counter-counter argument, 1: some of these bugs are fundamentally user errors (exhausting memory or similar resources).

Counter-counter argument, 2: from a purely practical perspective, so long as there continue to be explosions, providing some more detailed feedback makes it seem less unfriendly to new users, and gives them something much more useful to report as a bug.

— Reply to this email directly or view it on GitHubhttps://github.com/halide/Halide/issues/152#issuecomment-24092309 .

abadams commented 11 years ago

The new tracing stuff is all in, and there's even a test that prints a produce/consume stack trace on a segfault. It's not debug symbols, but I think I should mark this as closed.