microsoft / antares

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.
Other
449 stars 46 forks source link

Is ROCm no longer supported by 0.9.x? #374

Open Lookforworld opened 11 months ago

Lookforworld commented 11 months ago

@ghostplant I've tried to run ROCM on the WSL platform and haven't been able to find a good way, but I finally found it here and saw a silver lining. I want to try version 0.9.X but can't find a whl that supports ROCM. Install version 0.3.x and use BACKEND=c-rocm to prompt that the gpu cannot be found, use BACKEND=c-rocm_win64 to run and report "/home/root001/miniconda3/lib/python3.11/site-packages/antares_core/backends/c-rocm_win64/../../graph_evaluator/run_graph.cpp:14:29: error: ‘memalign’ was not declared in this scope 14 | void data_ptr = (void)memalign(256, length);" error, I don't know where to start to fix the error, is there an official guidance document to tell me the correct steps? 😒

ghostplant commented 11 months ago

0.9.x is a new implementation that can work on Windows without even WSL, while old features are all kept. So you still can follow "Path 1" to run ROCm custom kernels. The problem you suffered seems a C++ compatibility issue, let me fix it.

ghostplant commented 11 months ago

BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?

Lookforworld commented 11 months ago

BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?

@ghostplant Thanks for your reply. I have successfully installed ROCM and AMD SDK on Windows, and I have successfully compiled the llama.cpp, but I have not been able to use ROCM in WSL. Because many of the libraries I want to use do not support Windows and DirectX. The Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, I want to see if 0.9.X can succeed and what should I do? My device is gfx1100(7900xtx).

ghostplant commented 11 months ago

BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?

@ghostplant Thanks for your reply. I have successfully installed ROCM and AMD SDK on Windows, and I have successfully compiled the llama.cpp, but I have not been able to use ROCM in WSL. Because many of the libraries I want to use do not support Windows and DirectX. The Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, I want to see if 0.9.X can succeed and what should I do? My device is gfx1100(7900xtx).

I just push a commit to fix the error you reported. Do you install AMD SDK released for Win32, or install HIP released for Linux in WSL?

Lookforworld commented 11 months ago

@ghostplant Thanks! My HIP in WSL:

HIP version: 5.7.31921-d1770ee1b
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.7.0/llvm/bin

And the Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, how to install it if I want to use 0.9.X with ROCm?

ghostplant commented 11 months ago

v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.

Lookforworld commented 11 months ago

v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.

@ghostplant Thanks very much! How to install the fixed version now for me?

ghostplant commented 11 months ago

v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.

@ghostplant Thanks very much! How to install the fixed version now for me?

The PR was just applied in PyPI. You can install antares >= 0.3.24.0 from pip to include the fixed change: https://pypi.org/project/antares/#files

Lookforworld commented 11 months ago

Thanks for your kind reply! I'll try it!

Lookforworld commented 11 months ago

@ghostplant There's a new Erro😢:

 >> Backend = c-rocm_win64, Python PID = 1251, Task = lang.generic;

// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})

// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[524288], input1:float32[524288] -> output0:float32[524288]

#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__

#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w

#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)

#define __AMDGFX__ gfx1100

__forceinline__ __device__ __half hmax(const __half &a, const __half &b) { return a > b ? a : b; }
__forceinline__ __device__ __half hmin(const __half &a, const __half &b) { return a < b ? a : b; }

#endif

extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
  // [thread_extent] blockIdx.x = 524288
  // [thread_extent] threadIdx.x = 1
  output0[(((int)blockIdx.x))] = (input0[(((int)blockIdx.x))] + input1[(((int)blockIdx.x))]);
}

// ---------------------------------------------------------------------------

[EvalAgent] Evaluating Modules .. (for backend = c-rocm_win64)
+ /opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
clang++: error: no such file or directory: '/tmp/.antares-module-tempfile.cu'
clang++: error: no input files
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to execute command: sh -c 'wsl.exe sh -cx "/opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out 1>&2"'

[EvalAgent] Results = {}

[Antares] Incorrect compute kernel from evaluator.

But the file is in the right place. Is that because of permissions? Or is there something else wrong?

ghostplant commented 11 months ago

Can you try this version: https://files.pythonhosted.org/packages/cb/fe/5fef007100d8beaa64113d1da466a057db656ef5e0731140883bfc0ca05e/antares-0.3.24.1-py3-none-manylinux1_x86_64.whl

Lookforworld commented 11 months ago

Can you try this version: https://files.pythonhosted.org/packages/cb/fe/5fef007100d8beaa64113d1da466a057db656ef5e0731140883bfc0ca05e/antares-0.3.24.1-py3-none-manylinux1_x86_64.whl

@ghostplant The Erro didn't fix. This time there is no cu file under the path.😢

ghostplant commented 11 months ago

Can you attach the new error logs?

Lookforworld commented 11 months ago

Can you attach the new error logs?

@ghostplant Okey, the logs:

 >> Backend = c-rocm_win64, Python PID = 450, Task = lang.generic;

// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})

// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[524288], input1:float32[524288] -> output0:float32[524288]

#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__

#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w

#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)

#define __AMDGFX__ gfx1100

__forceinline__ __device__ __half hmax(const __half &a, const __half &b) { return a > b ? a : b; }
__forceinline__ __device__ __half hmin(const __half &a, const __half &b) { return a < b ? a : b; }

#endif

extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
  // [thread_extent] blockIdx.x = 524288
  // [thread_extent] threadIdx.x = 1
  output0[(((int)blockIdx.x))] = (input0[(((int)blockIdx.x))] + input1[(((int)blockIdx.x))]);
}

// ---------------------------------------------------------------------------

[EvalAgent] Evaluating Modules .. (for backend = c-rocm_win64)
+ /opt/rocm/bin/hipcc /mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out
Warning: The --amdgpu-target option has been deprecated and will be removed in the future.  Use --offload-arch instead.
clang++: error: no such file or directory: '/mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu'
clang++: error: no input files
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to execute command: sh -c 'wsl.exe sh -cx "/opt/rocm/bin/hipcc $TMP/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out 1>&2"'

[EvalAgent] Results = {}

[Antares] Incorrect compute kernel from evaluator.
ghostplant commented 11 months ago

It is unfortunate that I cannot reproduce this. Can you help to debug yourself why this two lines failed to create the file at C:\Users\Modys\AppData\Local\Temp\.antares-module-tempfile.cu. If it does create the file, then /mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu should be available inside WSL instead of not foud.

You can run vi $(antares pwd)/../backends/c-rocm_win64/include/backend.hpp to edit the file inline, saving the changes and it will be automatically recompiled at the next run of any antares compilation request.

Lookforworld commented 11 months ago

It is unfortunate that I cannot reproduce this. Can you help to debug yourself why this two lines failed to create the file at C:\Users\Modys\AppData\Local\Temp\.antares-module-tempfile.cu. If it does create the file, then /mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu should be available inside WSL instead of not foud.

You can run vi $(antares pwd)/../backends/c-rocm_win64/include/backend.hpp to edit the file inline, saving the changes and it will be automatically recompiled at the next run of any antares compilation request.

@ghostplant Ok! Thank you very much! If I have fixed it, I will tell you.