pytorch / FBGEMM

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/
Other
1.2k stars 494 forks source link

[fbgemm_gpu] HIP support for GenAI ops #2928

Open q10 opened 2 months ago

netlify[bot] commented 2 months ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
Latest commit d0f3f8c96aaa6892f6b7b6d637fd9acbf0fd1921
Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66ac1c6aef7fa10008a5f5f6
Deploy Preview https://deploy-preview-2928--pytorch-fbgemm-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mht-sharma commented 2 months ago

@q10 , I believe we also need to integrate or install the Composable Kernel (CK) for the GenAI ops. Were you able to build with CK? If so, could you please share the steps you followed? I’m running into some issues and would greatly appreciate any guidance you can provide.

cc @jeffdaily, I’ve seen similar PR 2610 from you and thought you might have some insights as well.

q10 commented 2 months ago

@q10 , I believe we also need to integrate or install the Composable Kernel (CK) for the GenAI ops. Were you able to build with CK? If so, could you please share the steps you followed? I’m running into some issues and would greatly appreciate any guidance you can provide.

cc @jeffdaily, I’ve seen similar PR 2610 from you and thought you might have some insights as well.

Ah yes, thanks for the pointer on CK. This work has stalled a bit due to other priorities, but ROCm support for GenAi ops is a work in progress.

jeffdaily commented 2 months ago

I had to install a new-enough CK to get your branch to build. I forgot to note down the commit hash that introduced the CK header file you need. And I had to apply this patch.


diff --git a/fbgemm_gpu/experimental/example/src/nccl_example.cpp b/fbgemm_gpu/experimental/example/src/nccl_example.cpp
index 12bd7201..921f0590 100644
--- a/fbgemm_gpu/experimental/example/src/nccl_example.cpp
+++ b/fbgemm_gpu/experimental/example/src/nccl_example.cpp
@@ -6,7 +6,11 @@
  * LICENSE file in the root directory of this source tree.
  */

+#ifdef USE_ROCM
+#include <rccl/rccl.h>
+#else
 #include <nccl.h>
+#endif

 namespace fbgemm_gpu::experimental {

diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/src
/quantize/ck_extensions/fp8_blockwise_gemm.hip
index 17d46048..72445dda 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip
@@ -12,7 +12,9 @@
 #include <numeric>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/src/q
uantize/ck_extensions/fp8_rowwise_gemm.hip
index 3a117321..63072bcb 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip
@@ -15,7 +15,9 @@
 #include <unordered_map>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/sr
c/quantize/ck_extensions/fp8_tensorwise_gemm.hip
index 6170675a..09a7947b 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip
@@ -12,7 +12,9 @@
 #include <numeric>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)