Also cleaned up the included header files in the generated code - rather than blindly including all relevant headers, the python code will return a list of the header files required by a particular kernel, in addition to the generated host code. This allows us to avoid touching any C++ code when adding new headers, and it may also speed up compilation a bit.
Now the C++-side codegen is truly bare-minimum, and extending CUTLASS BYOC can be done entirely in Python (in most cases).
Following https://github.com/tlc-pack/relax/pull/442, move conv2d host codegen to python.
Also cleaned up the included header files in the generated code - rather than blindly including all relevant headers, the python code will return a list of the header files required by a particular kernel, in addition to the generated host code. This allows us to avoid touching any C++ code when adding new headers, and it may also speed up compilation a bit.
Now the C++-side codegen is truly bare-minimum, and extending CUTLASS BYOC can be done entirely in Python (in most cases).
@vinx13 @yelite