microsoft / antares

Antares: an automatic engine for multi-platform kernel generation and optimization. Supporting CPU, CUDA, ROCm, DirectX12, GraphCore, SYCL for CPU/GPU, OpenCL for AMD/NVIDIA, Android CPU/GPU backends.
Other
435 stars 45 forks source link

Running ROCm computations on Windows over AMD GPU #345

Open LuisB79 opened 2 years ago

LuisB79 commented 2 years ago

Add proper documentation for antares usages, if possible some examples would be a great help for new people that want to test Antares.

ghostplant commented 2 years ago

Thanks for you request. Recently we are busy in updating some other functionalities. Later we'll add corresponding doc that allows you to use Rocm program for Windows more easily.

ffleader1 commented 2 years ago

Second this. I think OP has a 6800M and can run Rocm on Windows. That is awesome.

LuisB79 commented 2 years ago

Thanks for you request. Recently we are busy in updating some other functionalities. Later we'll add corresponding doc that allows you to use Rocm program for Windows more easily.

I understand, thank you for your progress.

ghostplant commented 2 years ago

Thanks for you request. Recently we are busy in updating some other functionalities. Later we'll add corresponding doc that allows you to use Rocm program for Windows more easily.

I understand, thank you for your progress.

You can try follow 2 steps in advance: Step-1: Generate Kernel Source

BACKEND=c-rocm_win64 COMPUTE_V1='- einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})' antares save kernel0.hip.cc

Step-2: Build Source into HSACO

hipcc kernel0.hip.cc --amdgpu-target=gfx1031 --genco -O2 -o kernel0.hip.hsaco

Step-3: Write win64 programs without WSL to utilize kernel0.hip.hsaco (a little complex, will be updated later)

LuisB79 commented 2 years ago

Thanks for you request. Recently we are busy in updating some other functionalities. Later we'll add corresponding doc that allows you to use Rocm program for Windows more easily.

I understand, thank you for your progress.

You can try follow 2 steps in advance: Step-1: Generate Kernel Source

BACKEND=c-rocm_win64 COMPUTE_V1='- einstein_v2("output0[N] = input0[N] + input1[N]", input_dict={"input0": {"dtype": "float32", "shape": [1024 * 512]}, "input1": {"dtype": "float32", "shape": [1024 * 512]}})' antares save kernel0.hip.cc

Step-2: Build Source into HSACO

hipcc kernel0.hip.cc --amdgpu-target=gfx1031 --genco -O2 -o kernel0.hip.hsaco

Step-3: Write win64 programs without WSL to utilize kernel0.hip.hsaco (a little complex, will be updated later)

thank you

ghostplant commented 2 years ago

Thanks for your waiting. The feature is now supported. Please try commands below:

# Upgrade Antares version
$ pip3 install antares==0.3.15.1 --upgrade

# Your GFX1031 video card is not as the first rank of AMDGPUs, so you need this to select the correct GPU index.
$ export DEVICE_ID=1

# Save the code that you want to compile for Windows ROCm
$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares save ./amd_example.cpp

# Compile it into a clean folder: dest-outputs
$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares compile ./amd_example.cpp dest-outputs/

# You'll get a simple project that can be built by MINGW64 or VC++, which is no longer related to WSL.
$ cd dest-outputs/ && make

After that, two files `kernels.bin` and `main.exe` (put together) can be executed on Windows without GCC and WSL dependencies.
LuisB79 commented 2 years ago
# Upgrade Antares version
$ pip3 install antares==0.3.15.1 --upgrade

# Your GFX1031 video card is not as the first rank of AMDGPUs, so you need this to select the correct GPU index.
$ export DEVICE_ID=1

# Save the code that you want to compile for Windows ROCm
$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares save ./amd_example.cpp

# Compile it into a clean folder: dest-outputs
$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares compile ./amd_example.cpp dest-outputs/

# You'll get a simple project that can be built by MINGW64 or VC++, which is no longer related to WSL.
$ cd dest-outputs/ && make

After that, two files `kernels.bin` and `main.exe` (put together) can be executed on Windows without GCC and WSL dependencies.

Awesome!, i will try it out!

LuisB79 commented 2 years ago

i'm getting this error

comp@U_03:~/bm3dhip$ dir
dest-outputs  kernel.hip.cpp  source.cpp
comp@U_03:~/bm3dhip$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares save ./kernel.hip.cpp
  >> Backend = c-rocm_win64, Python PID = 16261, Task = lang.generic;

// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[1024, 512], input1:float32[1024, 512] -> output0:float32[1024, 512]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[1024, 512], input1:float32[1024, 512] -> output0:float32[1024, 512]

#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__

#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w

#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)

#define __AMDGFX__ gfx1031

#endif

extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
  // [thread_extent] blockIdx.x = 1024
  // [thread_extent] threadIdx.x = 1
  // [thread_extent] blockIdx.y = 512
  // [thread_extent] threadIdx.y = 1
  output0[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))] = (input0[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))] + input1[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))]);
}

// ---------------------------------------------------------------------------

[EvalAgent] Compiling Evaluator: x86_64-w64-mingw32-g++ /home/comp/.local/lib/python3.8/site-packages/antares_core/backends/c-rocm_win64/../../graph_evaluator/run_graph.cpp -o /home/comp/.cache/antares/evaluator.c-rocm_win64.tmp -D__BACKEND__=\"c-rocm_win64\" -D__BACKEND_rocm_win64__ -I/home/comp/.local/lib/python3.8/site-packages/antares_core/backends/c-rocm_win64/include -std=c++17 -Wno-string-compare -Wno-unused-result -Wno-unused-value -O2 -static -lpthread

[EvalAgent] Evaluating Modules .. (with backend = c-rocm_win64)
'\\wsl.localhost\Ubuntu-20.04\home\comp\.cache\antares\cache\_'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported.  Defaulting to Windows directory.
+ /opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1031 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out

[EvalAgent] Results = {"K/0": 1504583185.0, "TPR": 0.00486145}

[Antares] Average time cost / run = 0.00486145 sec, 0.107846 gflops. (Checked: None)

comp@U_03:~/bm3dhip$ AMDGFX=gfx1031 BACKEND=c-rocm_win64 antares compile ./kernel.hip.cpp deest-output
s/
  >> Backend = c-rocm_win64, Python PID = 19458, Task = lang.generic;

// ---------------------------------------------------------------------------
// GLOBALS: input0:float32[1024, 512], input1:float32[1024, 512] -> output0:float32[1024, 512]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

// GLOBALS: input0:float32[1024, 512], input1:float32[1024, 512] -> output0:float32[1024, 512]
// BACKEND: c-rocm_win64 (default)
// CONFIG: null
// COMPUTE_V1: - einstein_v2("output0[N, M] = input0[N, M] + input1[N, M]", input_dict={"input0": {"dtype": "float32", "shape": [1024, 512]}, "input1": {"dtype": "float32", "shape": [1024, 512]}})

// ---------------------------------------------------------------------------
// LOCAL: template_op_kernel0 -- input0:float32[1024, 512], input1:float32[1024, 512] -> output0:float32[1024, 512]

#include <hip/hip_runtime.h>
#include <hip/hip_fp16.h>

#ifndef __ROCM_COMMON_MACRO__
#define __ROCM_COMMON_MACRO__

#define __ITEM_0_OF__(v) (v).x
#define __ITEM_1_OF__(v) (v).y
#define __ITEM_2_OF__(v) (v).z
#define __ITEM_3_OF__(v) (v).w

#define __STORE_ITEM_0__(t, out, ido, in, idi) *(t*)(out + ido) = *(t*)(in + idi)
#define __STORE_ITEM_1__(t, out, ido, in, idi)
#define __STORE_ITEM_2__(t, out, ido, in, idi)
#define __STORE_ITEM_3__(t, out, ido, in, idi)

#define __AMDGFX__ gfx1031

#endif

extern "C" __global__ __launch_bounds__(1) void template_op_kernel0(float* __restrict__ input0, float* __restrict__ input1, float* __restrict__ output0) {
  // [thread_extent] blockIdx.x = 1024
  // [thread_extent] threadIdx.x = 1
  // [thread_extent] blockIdx.y = 512
  // [thread_extent] threadIdx.y = 1
  output0[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))] = (input0[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))] + input1[(((((int)blockIdx.x) * 512) + ((int)blockIdx.y)))]);
}

// ---------------------------------------------------------------------------

[EvalAgent] Evaluating Modules .. (with backend = c-rocm_win64)
'\\wsl.localhost\Ubuntu-20.04\home\comp\.cache\antares\cache\_'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported.  Defaulting to Windows directory.
+ /opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1031 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out

[EvalAgent] Results = {"K/0": 1504583185.0, "TPR": 0.00469551}

[Antares] Average time cost / run = 0.00469551 sec, 0.111657 gflops. (Checked: None)

'\\wsl.localhost\Ubuntu-20.04\home\comp\.cache\antares\cache\_'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported.  Defaulting to Windows directory.
+ /opt/rocm/bin/hipcc /tmp/.antares-module-tempfile.cu --amdgpu-target=gfx1031 --genco -Wno-ignored-attributes -O2 -o /tmp/.antares-module-tempfile.cu.out
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Traceback (most recent call last):
  File "./antares/antares_compiler.py", line 712, in <module>
    main_compute()
  File "./antares/antares_compiler.py", line 618, in main_compute
    hex_code = eval_client.eval(kernel_path=kernel_path, dev_id=0, backend_root=backend_root, compile=1)['HEX']
KeyError: 'HEX'
LuisB79 commented 2 years ago

i'm trying to compile a vapoursynth filter that was made in cuda, it has the cuda or hip kernel kernel.hip.cpp and a vapoursynth wrapper source.cpp which needs some libraries that are in a "vapoursynth" folder, the end result will be a dll. If you wish i could upload the source code.

ghostplant commented 2 years ago

i'm trying to compile a vapoursynth filter that was made in cuda, it has the cuda or hip kernel kernel.hip.cpp and a vapoursynth wrapper source.cpp which needs some libraries that are in a "vapoursynth" folder, the end result will be a dll. If you wish i could upload the source code.

Please try this version: pip3 install antares==0.3.15.2. I think the problem is solved.

LuisB79 commented 2 years ago

i'm trying to compile a vapoursynth filter that was made in cuda, it has the cuda or hip kernel kernel.hip.cpp and a vapoursynth wrapper source.cpp which needs some libraries that are in a "vapoursynth" folder, the end result will be a dll. If you wish i could upload the source code.

Please try this version: pip3 install antares==0.3.15.2. I think the problem is solved.

it did worked as intended, i got the main.exe and the kernel.bin, i haven't runned it yet, i'm searching to get a dll out instead of an exe, how could i do that?, do i save both kernel.hip.cpp and source.cpp, then compile both into the clean folder? to later be compiled in mingw?, this is the source i'm trying to work with. hip_source.zip

LuisB79 commented 2 years ago

in a normal rocm linux environment i would need to use this to compile it/opt/rocm/hip/bin/hipcc source.cpp kernel.hip.cpp -o libbm3dhip.so -shared -fPIC -std=c++17 -O3 -I/home/comp/vapoursynth -Wno-unused-result --offload-arch=gfx1031 $(/opt/rocm/hip/bin/hipconfig --cxx_config)

ghostplant commented 2 years ago

You have to use ROCm driver-level API to dispatch kernel workloads for Windows ROCm. Definitely, your source code is not based on that style. antares compile <kernel.cpp> <outdir> generates a minimum driver-level example that shows you on how to do this correctly. You have to follow that.

kernel.bin is the object that you should precompile bm3d function in wsl via hipcc --genco .., and for hip APIs like hipSetDevice/hipMalloc/.. in source.cpp, you should change them according to https://github.com/microsoft/antares/blob/v0.3.x/backends/c-rocm_win64/include/backend.hpp#L24-L25.

LuisB79 commented 2 years ago

I see, i will see if the friend who is developing this can help me when he haves time.