mit-han-lab / torchsparse

[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
https://torchsparse.mit.edu
MIT License
1.15k stars 131 forks source link

About Sparse Kernel Generator in TorchSparse++ paper #274

Closed 99DHL closed 6 months ago

99DHL commented 7 months ago

Thank you for your great work! I came across your work TorchSparse++(MICRO '23) and really enjoyed reading your paper. I have several questions about the sparse kernel generator introduced in the paper. According to the paper, sparse kernel generator is a code generator that integrates on-chip MMA subroutines from TVM directly at the source code level. I am curious how this is possible. Could you provide more details about this code generator? Which part of the kernel is auto-generated and which part is hand written? Do I have to give some changes to TVM to get on-chip MMA subroutines that can be used at the source code level (CUDA level)? If so, could you provide us the implementation of your code generator?

zhijian-liu commented 6 months ago

@ys-2020, could you please take a look at this problem when you have time? Thanks!

ys-2020 commented 6 months ago

Hi @99DHL , thank you very much for your interest! We use the TVM GEMM template to get the on-chip MMA subroutines, which are corresponding to L159 - L245 in kernel. Starting from the MMA subroutines, we rewrote the DRAM access pointers rather than the MMA instructions to support sparse convolution.

99DHL commented 6 months ago

Thank you for your kind response! If possible, could you please provide information on the tvm GEMM template you used, such as relevant links or pages?

ys-2020 commented 6 months ago

Hi! I think we just followed the tvm documents to write the GEMM template and generate PTX for mma subroutines. The major logic of our conv kernel is re-designed.

ys-2020 commented 6 months ago

Close this issue as completed. Feel free to reopen it if you have any further questions.

getianao commented 4 months ago

@ys-2020 Sorry for jumping in on a closed issue, but I wanted to ask about something related. Does the fetch-on-demand dataflow also work with the on-chip MMA subroutines generated from TVM? I noticed the paper mentioned that "Similar analysis and code transformation can also be applied to the fetch-on-demand dataflow," but it seems like only the implicit GEMM implementation uses the generated kernel.