intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.26k stars 744 forks source link

Is there any documentation on address mapping to help developers use it? #13159

Open wangzy0327 opened 8 months ago

wangzy0327 commented 8 months ago

Is your feature request related to a problem? Please describe

It is planned to expand new hardware based on SYCL. No relevant guidance has been found regarding the development of the address mapping part.Can you provide instructions or documents on address mapping for developers to refer to? This is the code part for the relevant address mapping based on 2022-06 version.What is the meaning of the contents of the NVPTXAddrSpaceMap variable? Which source files are involved in the relevant address space and the APIs called? @AlexeySachkov @elizabethandrews

Can you give me some help?

Describe the solution you would like

It is planned to expand new hardware based on SYCL about device memory access development.

Describe alternatives you have considered

No response

Additional context

No response

asudarsa commented 8 months ago

Such info can be found in specification documents. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#_memory_model https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:memory.model https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-model

Hope that gets you started.

Thanks

KornevNikita commented 8 months ago

@wangzy0327 hi, did Arvind's answer help you?

wangzy0327 commented 8 months ago

@wangzy0327 hi, did Arvind's answer help you? When I was extending the SYCL code, I encountered the following error. It looks like an address space mapping problem. Can you give me some suggestions? How to analyze or debug?

@KornevNikita

The error line is PI_CHECK_ERROR(cnQueueSync(s)); pi_cnrt.cpp

pi_result cnrt_piQueueRelease(pi_queue command_queue) {
  assert(command_queue != nullptr);

  if (command_queue->decrement_reference_count() > 0) {
    return PI_SUCCESS;
  }

  try {
    std::unique_ptr<_pi_queue> queueImpl(command_queue);

    ScopedContext active(command_queue->get_context());

    command_queue->for_each_queue([](CNqueue s) {
      PI_CHECK_ERROR(cnQueueSync(s));
      PI_CHECK_ERROR(cnDestroyQueue(s));
    });

    return PI_SUCCESS;
  } catch (pi_result err) {
    return err;
  } catch (...) {
    return PI_ERROR_OUT_OF_RESOURCES;
  }
}

This is test program about the device extend. test_demo.cpp

#include <CL/sycl.hpp>
#include <iostream>
#include <vector>
#include <sys/time.h>
using namespace sycl;

constexpr int N = 256;

long long getTime() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (tv.tv_sec*1000000 + tv.tv_usec);
}

int main(){
    sycl::queue q;
    auto dev = q.get_device();
    float *a = (float *)malloc(sizeof(float) * N);
    float *b = (float *)malloc(sizeof(float) * N);
    float *c = (float *)malloc(sizeof(float) * N);
    float *c_host = (float *)malloc(sizeof(float) * N);

    for(int i = 0;i < N;i++){
        a[i] = 0.5f;b[i] = 0.5f;c[i] = 0.0f;c_host[i] = 1.0f;
    }

    range<1> arr_range(N);

    sycl::buffer<float,1> bufferA((float*)a,arr_range);
    sycl::buffer<float,1> bufferB((float*)b,arr_range);
    sycl::buffer<float,1> bufferC((float*)c,arr_range);

    auto startTime = getTime();
    q.submit([&](handler &h){
        sycl::accessor aA{bufferA,h,read_only};
        sycl::accessor aB{bufferB,h,read_only};
        sycl::accessor aC{bufferC,h,write_only};
        sycl::accessor<float, 1, sycl::access::mode::read_write, sycl::access::target::local> localAccA(N,h);
        sycl::accessor<float, 1, sycl::access::mode::read_write, sycl::access::target::local> localAccB(N,h);

        h.parallel_for<>(1,[=](sycl::id<1> i){
            for(int j = 0;j < N;j++){
                localAccA[j] = aA[j];
                localAccB[j] = aB[j];
                aC[j] = localAccA[j] + localAccB[j];
            }
        });

    });
    sycl::host_accessor host_accC(bufferC,read_only);
    std::cout << "Result: " << host_accC[0] << " .. " << host_accC[N - 1] << std::endl;    
    auto endTime = getTime();
    std::cout << "Time : " << endTime - startTime <<" us "<< std::endl;
    free(a);
    free(b);
    free(c);
    free(c_host);
    return 0;
asudarsa commented 8 months ago

Hi @wangzy0327

I tried to compile your code using 'clang++ -fsycl test.cpp'. Hope that is the right way. I ran into a few issues. When I looked closer at your code, I saw a few issues:

  1. h.parallel_for<>(1,[=](sycl::id<1> i){ ...} --> 'i' is not used inside the kernel
  2. According to SYCL 2020 doc, local accessors cannot be used in the parallel_for overloaded version you are using. Please look at 4.9.4.2.2. parallel_for invoke for details.

Thanks

wangzy0327 commented 7 months ago

I tried to compile the above sample code using the cuda version and extended hardware version of sycl released in 2022-06. The device-side llvm ir code compiled by sycl-cuda is as follows.

simple-add-sm_70.ll ``` ; Function Attrs: noinline norecurse define weak_odr dso_local void @_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_EUlNS0_2idILi1EEEE_(float addrspace(3)* noundef align 4 %_arg_localAccA, float add rspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !24 !kernel_arg_runtime_aligned !25 !kernel_arg_exclusive_ptr !25 !sycl_kernel_omit_args !26 {entry: %0 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0 %1 = load i64, i64* %0, align 8 %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %1 %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0 %3 = load i64, i64* %2, align 8 %add.ptr.i41 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %3 %4 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #4 %conv.i.i.i.i.i.i.i = sext i32 %4 to i64 %5 = tail call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #4 %conv.i1.i.i.i.i.i.i = sext i32 %5 to i64 %mul.i.i.i.i.i.i = mul nsw i64 %conv.i1.i.i.i.i.i.i, %conv.i.i.i.i.i.i.i %6 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #4 %conv.i2.i.i.i.i.i.i = sext i32 %6 to i64 %add.i.i.i.i.i.i = add nsw i64 %mul.i.i.i.i.i.i, %conv.i2.i.i.i.i.i.i %7 = tail call i32* @llvm.nvvm.implicit.offset() #4 %8 = load i32, i32* %7, align 4, !tbaa !14 %conv.i3.i.i.i.i.i.i = zext i32 %8 to i64 %add4.i.i.i.i.i.i = add nsw i64 %add.i.i.i.i.i.i, %conv.i3.i.i.i.i.i.i %cmp.i.i.i = icmp ult i64 %add4.i.i.i.i.i.i, 2147483648 tail call void @llvm.assume(i1 %cmp.i.i.i) #4 br label %for.body.i for.body.i: ; preds = %for.body.i, %entry %j.015.i = phi i32 [ 0, %entry ], [ %inc.i.1, %for.body.i ] %conv.i = zext i32 %j.015.i to i64 %arrayidx.i.i42 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %conv.i %arrayidx.ascast.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i42 to float* %9 = load float, float* %arrayidx.ascast.i.i, align 4, !tbaa !18 %arrayidx.i3.i = getelementptr inbounds float, float addrspace(3)* %_arg_localAccA, i64 %conv.i %arrayidx.ascast.i4.i = addrspacecast float addrspace(3)* %arrayidx.i3.i to float* store float %9, float* %arrayidx.ascast.i4.i, align 4, !tbaa !18 %add.i = fadd float %9, 5.000000e-01 %arrayidx.i11.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %conv.i %arrayidx.ascast.i12.i = addrspacecast float addrspace(1)* %arrayidx.i11.i to float* store float %add.i, float* %arrayidx.ascast.i12.i, align 4, !tbaa !18 %inc.i = or i32 %j.015.i, 1 %conv.i.1 = zext i32 %inc.i to i64 %arrayidx.i.i42.1 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %conv.i.1 %arrayidx.ascast.i.i.1 = addrspacecast float addrspace(1)* %arrayidx.i.i42.1 to float* %10 = load float, float* %arrayidx.ascast.i.i.1, align 4, !tbaa !18 %arrayidx.i3.i.1 = getelementptr inbounds float, float addrspace(3)* %_arg_localAccA, i64 %conv.i.1 %arrayidx.ascast.i4.i.1 = addrspacecast float addrspace(3)* %arrayidx.i3.i.1 to float* store float %10, float* %arrayidx.ascast.i4.i.1, align 4, !tbaa !18 %add.i.1 = fadd float %10, 5.000000e-01 %arrayidx.i11.i.1 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %conv.i.1 %arrayidx.ascast.i12.i.1 = addrspacecast float addrspace(1)* %arrayidx.i11.i.1 to float* store float %add.i.1, float* %arrayidx.ascast.i12.i.1, align 4, !tbaa !18 %inc.i.1 = add nuw nsw i32 %j.015.i, 2 %exitcond.not.i.1 = icmp eq i32 %inc.i.1, 256 br i1 %exitcond.not.i.1, label %_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit, label %for.body.i, !llvm.loop !22 _ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit: ; preds = %for.body.i ret void ; uselistorder directives uselistorder float addrspace(3)* %_arg_localAccA, { 1, 0 } uselistorder i32 %j.015.i, { 1, 0, 2 } uselistorder i32 %inc.i.1, { 1, 0 } } ```

The device-side llvm ir code compiled by the extended hardware is as follows.

simple-add-mtp_372.ll ``` ; Function Attrs: convergent noinline norecurse define weak_odr dso_local void @_ZTSN2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EE(%"cl ass.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_NumWorkItems, float addrspace(101)* noundef align 4 %_arg_localAccA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA1, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA2, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_localAccA3, float addrspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA4, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA5, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC7, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC8, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !9 !kernel_arg_runtime_aligned !10 !kernel_arg_exclusive_ptr !10 {entry: %0 = getelementptr inbounds %"class.cl::sycl::range", %"class.cl::sycl::range"* %_arg_NumWorkItems, i64 0, i32 0, i32 0, i64 0 %1 = load i64, i64* %0, align 8 %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0 %3 = load i64, i64* %2, align 8 %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %3 %4 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0 %5 = load i64, i64* %4, align 8 %add.ptr.i44 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %5 %6 = tail call i32 @llvm.mlvm.read.mlu.sreg.taskidx() #5 %conv.i.i.i.i.i.i = sext i32 %6 to i64 %call.i.i.i.i.i.i = tail call i64 @_Z23__spirv_NumWorkgroups_xv() #6 %call1.i.i.i.i.i.i = tail call i64 @_Z23__spirv_WorkgroupSize_xv() #6 %call.i.i.i.i.i = tail call noundef i64 @_Z22__spirv_GlobalOffset_xv() #7 %cmp.i.i = icmp sgt i32 %6, -1 tail call void @llvm.assume(i1 %cmp.i.i) #5 %cmp.not.i = icmp ugt i64 %1, %conv.i.i.i.i.i.i br i1 %cmp.not.i, label %for.body.i.i, label %_ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idI Li1EEEE_EclES4_.exit for.body.i.i: ; preds = %entry, %for.body.i.i %indvars.iv.i.i = phi i64 [ %indvars.iv.next.i.i, %for.body.i.i ], [ 0, %entry ] %arrayidx.i.i6.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %indvars.iv.i.i %arrayidx.ascast.i.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i6.i to float* %7 = load float, float* %arrayidx.ascast.i.i.i, align 4, !tbaa !11 %arrayidx.i3.i.i = getelementptr inbounds float, float addrspace(101)* %_arg_localAccA, i64 %indvars.iv.i.i %arrayidx.ascast.i4.i.i = addrspacecast float addrspace(101)* %arrayidx.i3.i.i to float* store float %7, float* %arrayidx.ascast.i4.i.i, align 4, !tbaa !11 %add.i.i = fadd float %7, 5.000000e-01 %arrayidx.i11.i.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i44, i64 %indvars.iv.i.i %arrayidx.ascast.i12.i.i = addrspacecast float addrspace(1)* %arrayidx.i11.i.i to float* store float %add.i.i, float* %arrayidx.ascast.i12.i.i, align 4, !tbaa !11 %indvars.iv.next.i.i = add nuw nsw i64 %indvars.iv.i.i, 1 %exitcond.not.i.i = icmp eq i64 %indvars.iv.next.i.i, 256 br i1 %exitcond.not.i.i, label %_ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4 _.exit, label %for.body.i.i, !llvm.loop !15 _ZNK2cl4sycl6detail18RoundedRangeKernelINS0_4itemILi1ELb1EEELi1EZZ4mainENKUlRNS0_7handlerEE_clES6_EUlNS0_2idILi1EEEE_EclES4_.exit: ; preds = %for.body.i.i, % entry ret void ; uselistorder directives uselistorder label %for.body.i.i, { 1, 0 } uselistorder i64 %indvars.iv.next.i.i, { 1, 0 } } ; Function Attrs: inaccessiblememonly mustprogress nocallback nofree nosync nounwind willreturn declare void @llvm.assume(i1 noundef) #1 ; Function Attrs: convergent declare dso_local noundef i64 @_Z22__spirv_GlobalOffset_xv() local_unnamed_addr #2 ; Function Attrs: convergent noinline norecurse define weak_odr dso_local void @_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_EUlNS0_2idILi1EEEE_(float addrspace(101)* noundef align 4 %_arg_localAccA, %"class .cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA1, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_localAccA2, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_localAccA3, float addrspace(1)* noundef readonly align 4 %_arg_aA, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA4, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aA5, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aA6, float addrspace(1)* noundef align 4 %_arg_aC, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC7, %"class.cl::sycl::range"* noundef byval(%"class.cl::sycl::range") align 8 %_arg_aC8, %"class.cl::sycl::id"* noundef byval(%"class.cl::sycl::id") align 8 %_arg_aC9) local_unnamed_addr #0 comdat !kernel_arg_buffer_location !17 !kernel_arg_runtime_aligned !18 !kernel_arg_exclusive_ptr !18 {entry: %0 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aA6, i64 0, i32 0, i32 0, i64 0 %1 = load i64, i64* %0, align 8 %add.ptr.i = getelementptr inbounds float, float addrspace(1)* %_arg_aA, i64 %1 %2 = getelementptr inbounds %"class.cl::sycl::id", %"class.cl::sycl::id"* %_arg_aC9, i64 0, i32 0, i32 0, i64 0 %3 = load i64, i64* %2, align 8 %add.ptr.i41 = getelementptr inbounds float, float addrspace(1)* %_arg_aC, i64 %3 %4 = tail call i32 @llvm.mlvm.read.mlu.sreg.taskidx() #5 %call.i.i.i.i.i.i = tail call i64 @_Z23__spirv_NumWorkgroups_xv() #6 %call1.i.i.i.i.i.i = tail call i64 @_Z23__spirv_WorkgroupSize_xv() #6 %call.i.i.i.i.i = tail call noundef i64 @_Z22__spirv_GlobalOffset_xv() #7 %cmp.i.i = icmp sgt i32 %4, -1 tail call void @llvm.assume(i1 %cmp.i.i) #5 br label %for.body.i for.body.i: ; preds = %for.body.i, %entry %indvars.iv.i = phi i64 [ 0, %entry ], [ %indvars.iv.next.i, %for.body.i ] %arrayidx.i.i42 = getelementptr inbounds float, float addrspace(1)* %add.ptr.i, i64 %indvars.iv.i %arrayidx.ascast.i.i = addrspacecast float addrspace(1)* %arrayidx.i.i42 to float* %5 = load float, float* %arrayidx.ascast.i.i, align 4, !tbaa !11 %arrayidx.i3.i = getelementptr inbounds float, float addrspace(101)* %_arg_localAccA, i64 %indvars.iv.i %arrayidx.ascast.i4.i = addrspacecast float addrspace(101)* %arrayidx.i3.i to float* store float %5, float* %arrayidx.ascast.i4.i, align 4, !tbaa !11 %add.i = fadd float %5, 5.000000e-01 %arrayidx.i11.i = getelementptr inbounds float, float addrspace(1)* %add.ptr.i41, i64 %indvars.iv.i %arrayidx.ascast.i12.i = addrspacecast float addrspace(1)* %arrayidx.i11.i to float* store float %add.i, float* %arrayidx.ascast.i12.i, align 4, !tbaa !11 %indvars.iv.next.i = add nuw nsw i64 %indvars.iv.i, 1 %exitcond.not.i = icmp eq i64 %indvars.iv.next.i, 256 br i1 %exitcond.not.i, label %_ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit, label %for.body.i, !llvm.loop !15 _ZZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_ENKUlNS0_2idILi1EEEE_clES5_.exit: ; preds = %for.body.i ret void ; uselistorder directives uselistorder i64 %indvars.iv.next.i, { 1, 0 } } ```

It is found that the handler of the extended hardware does not have the address 1 address number. How to fix this problem? How are the variable parameters of address 1 address defined and used? @KornevNikita @sommerlukas @elizabethandrews reference to source code (clang/lib/Basic/Targets/NVPTX.h)NVPTXAddrSpaceMap

github-actions[bot] commented 5 months ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

Thanks!

wangzy0327 commented 5 months ago

How to develop the address space mapping for expanding new hardware? Can you give some specific suggestions and guidance? @asudarsa

github-actions[bot] commented 3 months ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

Thanks!

github-actions[bot] commented 1 month ago

Hi! There have been no updates for at least the last 60 days, though the issue has assignee(s).

@asudarsa, could you please take one of the following actions:

Thanks!