[CUTLASS] Finish host codegen clean up

Following https://github.com/tlc-pack/relax/pull/442, move conv2d host codegen to python.

Also cleaned up the included header files in the generated code - rather than blindly including all relevant headers, the python code will return a list of the header files required by a particular kernel, in addition to the generated host code. This allows us to avoid touching any C++ code when adding new headers, and it may also speed up compilation a bit.

Now the C++-side codegen is truly bare-minimum, and extending CUTLASS BYOC can be done entirely in Python (in most cases).

@vinx13 @yelite

tlc-pack / relax

[CUTLASS] Finish host codegen clean up #448