stmt and stmt_html output are too low level

abadams commented 1 year ago

For example, they show already-compiled PTX assembly for cuda kernels instead of stmt ir, because those have already been offloaded. As more and more of codegen creeps into lowering, this problem will get worse. We need to identify a point in lowering at which the stmt should be preserved for stmt and stmt_html output. I propose just after the custom passes and before hexagon kernel offload.

See #7507

antonysigma commented 1 year ago

232558221-25826389-9ccc-4663-881b-1ed649d2367c

Uploading the screenshots to illustrate the stmt output, and why they are too low level since Halide 14.0 (?)

antonysigma commented 1 year ago

For that to happen we'd have to emit the .stmt file before the offload gpu pass, e.g. in Lower.cpp around line 441. I guess we'd have to stash that earlier stmt in the Module somewhere for use in the .stmt outputs.

@maaz139 Other than line 441 of lower.cpp, is there anywhere I can contribute as a Halide beginner? I suppose I can add an optional argument intent=stmt | codegen for a start?

Module lower(const std::vector<Function> &output_funcs, ..., Halide::my_intent_t = Stmt_only);

mcourteaux commented 1 year ago

I'd like to chime in, and state that I :heart: the PTX code in the Stmt HTML to get a good understanding of what's happening in a single CUDA thread. I examined these for months when working on something for work. Much like how one wants to check the generated assembly for CPU code, you'd want to validate what got generated for the GPU code.

In my opinion, the compilation pipeline should be respected for what it is: a pipeline. The code can be useful to inspect at different stages. Statement IR can be useful, PTX code can be useful. Just like we are able to get a stmt file showing IR and an s file showing the CPU assembly, I believe we should be able to see an output PTX output file. I'd imagine .s and .ptx. Or perhaps .x86-64.s and .ptx.s? Maybe there are existing conventions here already that I'm unaware of.

Alternatively, we output the PTX code in the assembly file, but that would cause us to lose the colors, and in general is hard to navigate, as the s file is less verbose and intuitive.

Considering all the discussions regarding the VizIR HTML, the problems it has, and the fact that now PTX code is likely to be another variable, I believe a possible solution could be to have more generator emit options: now there is only stmt and stmt_html. I think the following would be useful:

stmt (plain text stmt)
stmt_html (base mode for html stmt)
stmt_html_vizir (option for stmt_html which will include the vizir)
stmt_html_asm (option for the stmt_html which will include the asm)
stmt_html_ptx (option for the stmt_html which will lower the ir to ptx and include that)

In the above, stmt_html_xxxx options are meaningless without the base mode.

Additionally, to generalize the assembly emit option:

assembly (which will be the CPU instruction set of the Halide target)
assembly_ptx

If this sounds reasonable, I could give this a try to implement this. What I like about this, is that it will be backward-compatible. Feedback on this idea?

antonysigma commented 1 year ago

Attaching a side-by-side comparison. This is to illustrate the difference among: Halide IR, PTX IR, and assembly outputs.

mcourteaux commented 1 year ago

To make the discussion complete, I'll copy paste my thoughts I posted on the other PR here:

I have been thinking -- like most of you -- quite a bit about this. I believe it just makes sense to indeed pick a few point in the lowering process, stash the IR away into the Module for later, and emit multiple files. Each file representing one point in the lowering process.

I'm absolutely not sure what to do with the PTX code and the assembly tab. It is assembly, and therefore could go there, but it's not the assembly of the pipeline. It really is a buffer containing assembly for the GPU. Maybe... Perhaps... It is more sensible to have a "Buffers" tab?

Overall, I believe the HTML-way of doing things is getting convoluted, and I'm thinking of an approach where we just dump a very information-rich IR tree to some file format and have a custom DearImGui-based tool be the visualizer? I don't know what the recently-merged "experimental serializer" is exactly capable of, but maybe piggybacking on that to dump flatbuffers of the IR at a few points of the lowering process can be a nice trick here to build such a tool. @TH3CHARLie Can the serializer you built be used for dumping an IR tree?

steven-johnson commented 1 year ago

Can the serializer you built be used for dumping an IR tree?

Yes, that's pretty much the only thing it can be used for :-)

antonysigma commented 1 year ago

Moving the discussions from the draft PR here. Feel free to clarify if I mis-quoted you.

maaz139 : From Given that we now have an assembly tab when using the html output, perhaps we can instead simply add a jump [from the gpu_block block in the IR tab] to the corresponding [decoded PTX IRs in the rodata (?) buffer] line in assembly.

@mcourteaux pointed out the PTX code dump in the cuda_gpu_source_kernels indeed is an extension of the LLVM IR. I took a minute to read the LLVM documentation, and yea, indeed it is called the internal representation (IR) of Nvidia's GPU code. Thanks @mcourteaux .

@maaz139 I am not sure if the "assembly tab" will visualize the PTX code, even if we add a "jump" button. If I understand the docs correctly, the PTX IR is further "lowered" by LLVM into Nvidia bytecode, then encoded/obfuscated into a buffer entry in the assembly. So, the jump button may be useless here.

maaz139 : I think printing any IR other than what is present after all lowering steps may be confusing or misleading (if thats is the only output). Perhaps the answer is allowing users to view the IR after each lowering stage or more realistically at a few key checkpoints during the lowering pipeline?

Understood. I retracted my draft PR. It does serve my most immediate needs. That is, to reason about the gpu_block and gpu_thread sizes, and to flush out redundant compute by inspecting the 3GL style Halide IR.

Agreed it can be misleading if (other) users' goal, unlike mine, is to cross-reference the lowered IR and the assembly output.

antonysigma commented 1 year ago

mcourteaux : [The presence of buffer cuda_gpu_source_kernels print out in the IR tab] boils down to the other comment of @maaz139. Because that [PTX code] really is the lowered IR.

Perhaps we use the term "assembly tab/code/dump" too broadly, creating confusions. There's a more formal term available: 2GL and 3GL. When I said the PTX as an IR, looks too low level, I mean the PTX dump looks too much like the 2GL. I was expecting a 3GL-style textual representation of the Halide IR back in Halide 10.0.

@maaz139 , could you please drive the discussion on the work scope? That is, how low, 2GL or 3GL, should Halide IR should be printed as stmt ?

In the meantime, I will pay the effort to learn the PTX IR syntax.

(An off-topic opinion: I personally call the Halide language the 4GL, and languages like the the CVXGEN the 5GL.)

mcourteaux commented 1 year ago

@mcourteaux pointed out the PTX code dump in the cuda_gpu_source_kernels indeed is an extension of the LLVM IR. I took a minute to read the LLVM documentation, and yea, indeed it is called the internal representation (IR) of Nvidia's GPU code. Thanks @mcourteaux .

@antonysigma To clear this up: what I meant there is that the IR printed to the HTML file is actually the IR that gets compiled as CPU code. The gpu_kernel_sources is actually already "fully compiled" in the sense that it just gets handed over to the CUDA driver at runtime. The IR surrounding this raw byte-buffer, is the code that actually will pass it to the CUDA driver and perform the kernel launch.

I'm not sure about the LLVM IR PTX you describe... As far as I understood, the buffer contains PTX code, which is NVIDIA specific, and not really anything related to LLVM.

Yes, that's pretty much the only thing it can be used for :-)

@steven-johnson Seems tho, that at this point, the interface only supports dumping a Pipeline. I guess adding an interface that will take a LoweredFunc or Stmt is an easy addition.

steven-johnson commented 1 year ago

Yeah -- it's still experimental, and thus the precise use cases that are sensible to allow (or disallow) are still a bit fluid (and likely will be for a version, at least). Feel free to offer a PR that allows de/serializing fragments.

TH3CHARLie commented 1 year ago

Stmt should be easy, not sure about LoweredFunc though, the serializer was created only to serialize front-end IRs.

maaz139 commented 1 year ago

@maaz139 , could you please drive the discussion on the work scope? That is, how low, 2GL or 3GL, should Halide IR should be printed as stmt ?

Thats a great question. In my personal opinion, the statement files are meant to look like 3GL code. The whole point of printing the Stmt file is to look at code that is not 2GL -- at least in my use cases. I can't really comment on why folks decided to print low-level PTX code inside the stmt files, I was not involved with the code back then but I can imagine there was a demand for it.

Generally speaking, I agree that different users typically seek out different details. Generating different variations of the stmt_html files is one way to disentangle different "views" into the Halide program and have consistent expectations with each view. I am not sure if there should be dedicated generator flags for each, perhaps we can generate all the of versions whenever a user selects stmt_html as a target. This approach is especially appealing if users can still jump across views without losing where they are in the program.

Overall, I believe the HTML-way of doing things is getting convoluted, and I'm thinking of an approach where we just dump a very information-rich IR tree to some file format and have a custom DearImGui-based tool be the visualizer? I don't know what the recently-merged "experimental serializer" is exactly capable of, but maybe piggybacking on that to dump flatbuffers of the IR at a few points of the lowering process can be a nice trick here to build such a tool. @TH3CHARLie Can the serializer you built be used for dumping an IR tree?

That seems like a fair bit of work but I concur that the HTML based method is not very scalable. I'm happy to contribute on the de-serialization of missing lower-level IR constructs.

mcourteaux commented 1 year ago

I can't really comment on why folks decided to print low-level PTX code inside the stmt files, I was not involved with the code back then but I can imagine there was a demand for it.

@maaz139 I requested this long ago, and eventually contributed this feature in #6444 (buffer in the stmt HTML) and #6447 (syntax highlighting of the buffer). I can maybe answer this by asking a question instead: how do you otherwise review what the code generated looks like. I originally asked about this exact thing in #6410, as I was clueless on how to check out what the code is that actually gets run as CUDA kernel. @abadams helped me out and said that one can set the environment variable HL_DEBUG_CODEGEN=1. This will output the PTX code to stdout during the time the generator runs. Not at all pleasant to work with. Making the PTX available in the stmt file made sense to me, as that is really what gets compiled.

mcourteaux commented 1 year ago

But I agree that a patch like @antonysigma proposed in #7753 looks super useful to get something like in the screenshot above. I don't know how you ever got to that screenshot (supposedly in Halide 10?), because the GPU-specific Stmt IR got already offloaded in the Lowering passes before the Generator could even generate the HMTL.

maaz139 commented 1 year ago

Gotcha! That makes sense. I agree that having specialized views into different stages of the pipeline would be a nice solution.

antonysigma commented 1 year ago

I don't know how you ever got to that screenshot (supposedly in Halide 10?), because the GPU-specific Stmt IR got already offloaded in the Lowering passes before the Generator could even generate the HMTL.

Yes, the screenshot was created in Halide 10.0. At the time, Halide 14.0 and Halide 10.0 still retain some form of API/ABI compatibility. I exploited that to "switch" between the Halide IR view and the so-called :Halide IR with offloaded PTX" view. I can no longer do that beyond Halide 15.0.

I agree that having specialized views into different stages of the pipeline would be a nice solution.

Great! Are we getting a multi-panel HTML page, a multiple HTML pages featuring various stages? Both a are fine with me. Again, my use case is to:

reason about the gpu_block and gpu_thread split sizes in a single GPU kernel; and to
explore compute-cache trade-off by fusing two GPU kernels into one with compute_at methods.

These, I think, will require Halide IR printouts to be in 3GL to be productive.

antonysigma commented 1 year ago

Making the PTX available in the stmt file made sense to me, as that is really what gets compiled.

I am with you on the PTX printout requirement. I know a few companies who demand 2GL program listings for security/high-availability auditing purposes. I simply don't know how popular is such unorthodox usage of Halide. I will defer to those who actually use the PTX printout features.

I can give my two cents on the purpose of the direct PTX/NEON/AVX printouts. I encountered a few industries who treats the stmt + assembly as the only "acceptable" outputs from the AOT generator. Their industries requires MISRA or ISO13485 regulation, thus needing a physical person to endorse the code. It is obviously impractical and immoral to hold the AOT generator (a machine) or the Halide project owners (a tool maker) legally liable to the commercial loss, or even the loss of life for running Halide codes in the instrument.

So, these industries adopts a "cleanroom" design protocol: nothing (Halide generated) gets in, nothing (proprietary) gets out. In other words, employee A programs the AOT to generate the stmt_html output, and hands it over to employer B. Employee B having no knowledge of Halide or AOT, reads the digital/printed program listings and manually type it into a separate computer. This is done so that a physical person (A or B), hired by the company are responsible for the commercial loss.

Yeah, I know such a cleanroom protocol effectively rejects Halide's core design philosophy. Again, I am simply not sure how popular is this approach, and the unorthodox use of stmt_html output.

mcourteaux commented 1 year ago

:exploding_head: Waw... Crazy.

Great! Are we getting a multi-panel HTML page, a multiple HTML pages featuring various stages? Both a are fine with me. Again, my use case is to:

reason about the gpu_block and gpu_thread split sizes in a single GPU kernel; and to

explore compute-cache trade-off by fusing two GPU kernels into one with compute_at methods.

These, I think, will require Halide IR printouts to be in 3GL to be productive.

Today I got back into making GPU schedules in Halide, and here I am: struggling with the PTX only code. I think I will one of these days revive the idea in #7753 and make it generate three separate HTML files:

Stmt before offloading + PTX pane + Assembly pane (where I kick the VizIR out the door). Here, the jump-to-assembly buttons should probably also work to jump to the corresponding PTX instead. It seems that LLVM also puts the basic block labels as a comment in the PTX.
Stmt after offloading + Assembly pane (this is what we currently have).
PTX code with the syntax highlighting alone. This one is probably optional, but might be nice to have.

Maybe I'm missing something, but I believe this could be a reasonable solution that offers both the 3GL and 2GL code views. If you have any tips or requests regarding this, please let me know such that I can consider those when working on it.

mcourteaux commented 1 year ago

I'm working on this, and now have .conceptual.stmt.html as well as the normal .stmt.html.

Demo of .conceptual.stmt.html:

And .stmt.html:

As you can see, I'm attempting slight background shading to improve high level structure visibility of the program:

Note the dashed line on the left of the hovered collapsible block. Collapsible blocks were replaced with checkboxes and pure CSS.

I'm mostly done with the HTML part of the generation. Will now work on the split pane logic.

Additionally, I added generator emit option ptx_assembly, which generates the PTX file.

Of course, the original PTX buffer is still in the HTML, but by default collapsed:

mcourteaux commented 1 year ago

Still working on this! PR in a few days. :smile:

antonysigma commented 1 year ago

Thanks @mcourteaux . I am looking forward to the PR. My web development skill is 20 years out of date (XHTML 1.0, Backbone.js, MVC-based architecture). But I can help review the UX stuff, verify the Makefile rules, and check for eslint linter warnings.

mcourteaux commented 1 year ago

@abadams @steven-johnson I'm currently working on adding an assembly split pane for "devices" (as opposed to "host" code). I am only familiar thus far with CUDA PTX as a Halide "device". I wonder if, for example, if the Hexagon DSP code is considered "host" code in the end, or if it's also threaded as a device in Halide.

I see that the OffloadGPULoops.cpp is the main place that I'm familiar with to generate a buffer containing separate LLVM-generated code. So, basically, I'm wondering if there are others, and if my choice of calling the assembly panes "Host Assembly" and "Device Assembly" are reasonable choices of names. Maybe I'd need slightly more generic name such as "Device Code", as I see that Compute Shaders of different kinds are also options, which do not necessarily appear as assembly-like stuff.

Now that I'm thinking about it, not all of those will use LLVM as their backend probably?

abadams commented 1 year ago

I think we can compile hexagon in either mode (it's the host code, or it's device code embedded in a Buffer like PTX). While it be very cool to be able to see the hexagon assembly, the challenge might be that hexagon is already compiled into binary when it's embedded. For opencl, metal, d3d12 etc, the embedded buffer is shader source code. It would be cool to have that in a pane. I think "device code" is a good name.

mcourteaux commented 1 year ago

While it be very cool to be able to see the hexagon assembly, the challenge might be that hexagon is already compiled into binary when it's embedded.

Meaning that we'd need to disassemble it first, or ask the compiler to additionally generate an assembly file next to the binary? The result is going to be more human readable if we get the compiler to generate a textual assembly file.

abadams commented 1 year ago

I was thinking we'd ask the compiler to additionally generate the assembly file if possible.

mcourteaux commented 1 year ago

Okay, I'm done with the Stmt HTML stuff. I'm wondering now... Do we want to keep the VizTree stuff? I haven't looked at that yet.

Currently looks like this:

I have added jump-to-device-code buttons as well. Panes are collapsible and resizable.

Overall performance is great, and there is no jQuery or bootstrap used. Only dependency right now is the syntax highlighter for the assembly.

mcourteaux commented 1 year ago

Mostly resolved by #7843 being merged.

halide / Halide

stmt and stmt_html output are too low level #7519