Open jeromeku opened 10 months ago
I would perhaps suggest this video giving an overview of TorchInductor: https://www.youtube.com/watch?v=p13HpZv2S3Q
Another thing you can check out is TORCH_LOGS="output_code"
, which'll show you the actual triton kernels that are generated.
Other than that, there is somewhat of a lack of publicly available educational resources on Inductor, hopefully we'll able to release some at some point.
@Chillee
Thanks! I've been using TORCH_LOGS=all
to dump the entire compilation process though this is probably overkill.
Would be instructive to have a tutorial that steps through the compilation pipeline for a simple module with a focus on the backend lowering / codegen. Lmk if something like this exists already or would be useful to the community.
Will do some more digging around the inductor
tests to gather digestible bits.
Btw, enjoy your blogposts / tweets on gpu performance :) Hope to see more of these.
FYI: I'm developping a walk-through example of torch.compile
, although the focus is more on the Dynamo and AOTAutograd side. The detailed working procedure of inductor is harder to describe. Hope someday I can figure it out later.
Great blogpost!
Is there any documentation on how
inductor
lowers theops
in thefx graph
to actual kernels -- specifically the optimization / tuning that determines the actual kernel implementations that are codegen'ed?For example, in the blogpost, you mention that the
GEMV
kernels generated bytorch.compile
are faster than handwritten / proprietary kernels fromcuBlas
andFlashAttention
.I'd like to better understand the lowering passes that enables this:
select_algorithm.py
,triton_heuristics.py
, themm
-specifickernels
directory withininductor
, etc. but am having trouble putting it all together.Any suggestions / resources to illuminate this process would be greatly appreciated.
Thanks!