Open Lookforworld opened 11 months ago
0.9.x is a new implementation that can work on Windows without even WSL, while old features are all kept. So you still can follow "Path 1" to run ROCm custom kernels. The problem you suffered seems a C++ compatibility issue, let me fix it.
BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?
BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?
@ghostplant Thanks for your reply. I have successfully installed ROCM and AMD SDK on Windows, and I have successfully compiled the llama.cpp, but I have not been able to use ROCM in WSL. Because many of the libraries I want to use do not support Windows and DirectX. The Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, I want to see if 0.9.X can succeed and what should I do? My device is gfx1100(7900xtx).
BTW, do you consider to turn to DirectX on Windows instead of ROCm on Windows?
@ghostplant Thanks for your reply. I have successfully installed ROCM and AMD SDK on Windows, and I have successfully compiled the llama.cpp, but I have not been able to use ROCM in WSL. Because many of the libraries I want to use do not support Windows and DirectX. The Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, I want to see if 0.9.X can succeed and what should I do? My device is gfx1100(7900xtx).
I just push a commit to fix the error you reported. Do you install AMD SDK released for Win32, or install HIP released for Linux in WSL?
@ghostplant Thanks! My HIP in WSL:
HIP version: 5.7.31921-d1770ee1b
AMD clang version 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.7.0 23352 d1e13c532a947d0cbfc94759c00dcf152294aa13)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.7.0/llvm/bin
And the Antares I installed with "Path 1" has always been 0.3.X instead of 0.9.X, how to install it if I want to use 0.9.X with ROCm?
v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.
v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.
@ghostplant Thanks very much! How to install the fixed version now for me?
v0.9.x is a re-implementation without WSL dependency. It may take time to merge a lot of features from v0.3.x. Suggest using DirectX v0.9.x for now since they are similar in performance.
@ghostplant Thanks very much! How to install the fixed version now for me?
The PR was just applied in PyPI. You can install antares >= 0.3.24.0
from pip to include the fixed change: https://pypi.org/project/antares/#files
Thanks for your kind reply! I'll try it!
@ghostplant There's a new Erro😢:
>> Backend = c-rocm_win64, Python PID = 1251, Task = lang.generic;
// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})
// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__
#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w
#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)
#define __AMDGFX__ gfx1100
__forceinline__ __device__ __half hmax(const __half &a, const __half &b) { return a > b ? a : b; }
__forceinline__ __device__ __half hmin(const __half &a, const __half &b) { return a < b ? a : b; }
#endif
extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
// [thread_extent] blockIdx.x = 524288
// [thread_extent] threadIdx.x = 1
output0[(((int)blockIdx.x))] = (input0[(((int)blockIdx.x))] + input1[(((int)blockIdx.x))]);
}
// ---------------------------------------------------------------------------
[EvalAgent] Evaluating Modules .. (for backend = c-rocm_win64)
+ /opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
clang++: error: no such file or directory: '/tmp/.antares-module-tempfile.cu'
clang++: error: no input files
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to execute command: sh -c 'wsl.exe sh -cx "/opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out 1>&2"'
[EvalAgent] Results = {}
[Antares] Incorrect compute kernel from evaluator.
But the file is in the right place. Is that because of permissions? Or is there something else wrong?
Can you try this version: https://files.pythonhosted.org/packages/cb/fe/5fef007100d8beaa64113d1da466a057db656ef5e0731140883bfc0ca05e/antares-0.3.24.1-py3-none-manylinux1_x86_64.whl
@ghostplant The Erro didn't fix. This time there is no cu file under the path.😢
Can you attach the new error logs?
Can you attach the new error logs?
@ghostplant Okey, the logs:
>> Backend = c-rocm_win64, Python PID = 450, Task = lang.generic;
// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})
// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[524288], input1:float32[524288] -> output0:float32[524288]
#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>
#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__
#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w
#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)
#define __AMDGFX__ gfx1100
__forceinline__ __device__ __half hmax(const __half &a, const __half &b) { return a > b ? a : b; }
__forceinline__ __device__ __half hmin(const __half &a, const __half &b) { return a < b ? a : b; }
#endif
extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
// [thread_extent] blockIdx.x = 524288
// [thread_extent] threadIdx.x = 1
output0[(((int)blockIdx.x))] = (input0[(((int)blockIdx.x))] + input1[(((int)blockIdx.x))]);
}
// ---------------------------------------------------------------------------
[EvalAgent] Evaluating Modules .. (for backend = c-rocm_win64)
+ /opt/rocm/bin/hipcc /mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out
Warning: The --amdgpu-target option has been deprecated and will be removed in the future. Use --offload-arch instead.
clang++: error: no such file or directory: '/mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu'
clang++: error: no input files
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to execute command: sh -c 'wsl.exe sh -cx "/opt/rocm/bin/hipcc $TMP/.antares-module-tempfile.cu --amdgpu-target=gfx1100 --genco -Wno-ignored-attributes -O2 -o .antares-module-tempfile.cu.out 1>&2"'
[EvalAgent] Results = {}
[Antares] Incorrect compute kernel from evaluator.
It is unfortunate that I cannot reproduce this. Can you help to debug yourself why this two lines failed to create the file at C:\Users\Modys\AppData\Local\Temp\.antares-module-tempfile.cu
. If it does create the file, then /mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu
should be available inside WSL instead of not foud.
You can run vi $(antares pwd)/../backends/c-rocm_win64/include/backend.hpp
to edit the file inline, saving the changes and it will be automatically recompiled at the next run of any antares compilation request.
It is unfortunate that I cannot reproduce this. Can you help to debug yourself why this two lines failed to create the file at
C:\Users\Modys\AppData\Local\Temp\.antares-module-tempfile.cu
. If it does create the file, then/mnt/c/Users/Modys/AppData/Local/Temp/.antares-module-tempfile.cu
should be available inside WSL instead of not foud.You can run
vi $(antares pwd)/../backends/c-rocm_win64/include/backend.hpp
to edit the file inline, saving the changes and it will be automatically recompiled at the next run of any antares compilation request.
@ghostplant Ok! Thank you very much! If I have fixed it, I will tell you.
@ghostplant I've tried to run ROCM on the WSL platform and haven't been able to find a good way, but I finally found it here and saw a silver lining. I want to try version 0.9.X but can't find a whl that supports ROCM. Install version 0.3.x and use BACKEND=c-rocm to prompt that the gpu cannot be found, use BACKEND=c-rocm_win64 to run and report "/home/root001/miniconda3/lib/python3.11/site-packages/antares_core/backends/c-rocm_win64/../../graph_evaluator/run_graph.cpp:14:29: error: ‘memalign’ was not declared in this scope 14 | void data_ptr = (void)memalign(256, length);" error, I don't know where to start to fix the error, is there an official guidance document to tell me the correct steps? 😒