[fbgemm_gpu] HIP support for GenAI ops

netlify[bot] commented 2 months ago

Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
Latest commit	d0f3f8c96aaa6892f6b7b6d637fd9acbf0fd1921
Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66ac1c6aef7fa10008a5f5f6
Deploy Preview	https://deploy-preview-2928--pytorch-fbgemm-docs.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mht-sharma commented 2 months ago

@q10 , I believe we also need to integrate or install the Composable Kernel (CK) for the GenAI ops. Were you able to build with CK? If so, could you please share the steps you followed? I’m running into some issues and would greatly appreciate any guidance you can provide.

cc @jeffdaily, I’ve seen similar PR 2610 from you and thought you might have some insights as well.

q10 commented 2 months ago

@q10 , I believe we also need to integrate or install the Composable Kernel (CK) for the GenAI ops. Were you able to build with CK? If so, could you please share the steps you followed? I’m running into some issues and would greatly appreciate any guidance you can provide.

cc @jeffdaily, I’ve seen similar PR 2610 from you and thought you might have some insights as well.

Ah yes, thanks for the pointer on CK. This work has stalled a bit due to other priorities, but ROCm support for GenAi ops is a work in progress.

jeffdaily commented 2 months ago

I had to install a new-enough CK to get your branch to build. I forgot to note down the commit hash that introduced the CK header file you need. And I had to apply this patch.


diff --git a/fbgemm_gpu/experimental/example/src/nccl_example.cpp b/fbgemm_gpu/experimental/example/src/nccl_example.cpp
index 12bd7201..921f0590 100644
--- a/fbgemm_gpu/experimental/example/src/nccl_example.cpp
+++ b/fbgemm_gpu/experimental/example/src/nccl_example.cpp
@@ -6,7 +6,11 @@
  * LICENSE file in the root directory of this source tree.
  */

+#ifdef USE_ROCM
+#include <rccl/rccl.h>
+#else
 #include <nccl.h>
+#endif

 namespace fbgemm_gpu::experimental {

diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/src
/quantize/ck_extensions/fp8_blockwise_gemm.hip
index 17d46048..72445dda 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip
@@ -12,7 +12,9 @@
 #include <numeric>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/src/q
uantize/ck_extensions/fp8_rowwise_gemm.hip
index 3a117321..63072bcb 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_gemm.hip
@@ -15,7 +15,9 @@
 #include <unordered_map>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)
diff --git a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip b/fbgemm_gpu/experimental/gen_ai/sr
c/quantize/ck_extensions/fp8_tensorwise_gemm.hip
index 6170675a..09a7947b 100644
--- a/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip
+++ b/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip
@@ -12,7 +12,9 @@
 #include <numeric>

 #include <ATen/ATen.h>
-#include <c10/cuda/CUDAStream.h>
+// normally hipify does this substitution for us, but this file isn't hipified
+//#include <c10/cuda/CUDAStream.h>
+#include <ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h>
 #include <torch/torch.h>

 #if defined(USE_ROCM)

pytorch / FBGEMM