Inductor Op Lowering - Githubissues

jeromeku commented 10 months ago

Great blogpost!

Is there any documentation on how inductor lowers the ops in the fx graph to actual kernels -- specifically the optimization / tuning that determines the actual kernel implementations that are codegen'ed?

For example, in the blogpost, you mention that the GEMV kernels generated by torch.compile are faster than handwritten / proprietary kernels from cuBlas and FlashAttention.

I'd like to better understand the lowering passes that enables this:

Stepping through the compilation process in the debugger gets a bit muddled through the various layers of abstractions (more likely that I need to get better at debugging)
I've reviewed select_algorithm.py, triton_heuristics.py, the mm-specific kernels directory within inductor, etc. but am having trouble putting it all together.

Any suggestions / resources to illuminate this process would be greatly appreciated.

Thanks!

Chillee commented 10 months ago

I would perhaps suggest this video giving an overview of TorchInductor: https://www.youtube.com/watch?v=p13HpZv2S3Q

Another thing you can check out is TORCH_LOGS="output_code", which'll show you the actual triton kernels that are generated.

Other than that, there is somewhat of a lack of publicly available educational resources on Inductor, hopefully we'll able to release some at some point.

jeromeku commented 10 months ago

@Chillee Thanks! I've been using TORCH_LOGS=all to dump the entire compilation process though this is probably overkill.

Would be instructive to have a tutorial that steps through the compilation pipeline for a simple module with a focus on the backend lowering / codegen. Lmk if something like this exists already or would be useful to the community.

Will do some more digging around the inductor tests to gather digestible bits.

Btw, enjoy your blogposts / tweets on gpu performance :) Hope to see more of these.

youkaichao commented 10 months ago

FYI: I'm developping a walk-through example of torch.compile, although the focus is more on the Dynamo and AOTAutograd side. The detailed working procedure of inductor is harder to describe. Hope someday I can figure it out later.

pytorch-labs / gpt-fast

Inductor Op Lowering #15