llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.85k stars 11.92k forks source link

[CUDA][HIP] comparison of device functions not allowed in host function #105825

Open yxsamliu opened 2 months ago

yxsamliu commented 2 months ago

HIP/CUDA has separate compilations for host and device. Instructions of host functions are generated by host compilation, during which the compiler has no access to device function pointers. The device functions seen by host code is just a placeholder address, not the real device function address. If the placeholder address is stored to a variable, then passed to kernel and called there, it won’t work. To avoid misuse, clang forbids using of device functions in host function.

However, if device function is used for comparison with each other, it should be fine, e.g.

__device__ float dfn1(float) { return 1;}
__device__ float dfn2(float) { return 2;}

template<float (*OP)(float)> 
__global__  void some_kernel(float *x, float y) {
  *x = OP(y) + 10 *y;
}

template<float (*OP)(float)>
void run_kernel(float* x) {
   constexpr float param = (OP == &dfn1) ? 1 : 0;
   some_kernel<OP><<<1,1>>>(x, param);
}

void run(float* x) {
    run_kernel<dfn1>(x);
}

However, currently clang diagnose the above code (https://godbolt.org/z/d4aW1oor4 ) whereas nvcc allows it (https://godbolt.org/z/YWYKeTr67).

Basically, nvcc only diagnose call of device functions in host function and allows other uses, while clang diagnose any ODR-use of device functions in host functions.

I think we may want to be consistent with nvcc regarding use of device functions in host functions.

@Artem-B

Artem-B commented 2 months ago

Basically, nvcc only diagnose call of device functions in host function and allows other uses, while clang diagnose any ODR-use of device functions in host functions.

What does "allow other uses" mean for something that physically does not exist on the host side? OK, for comparison we can return true if we're comparing device-side pointer to itself, and false for any comparison with a host pointer. But what are we supposed to do if someone wants to know the value of that pointer?

What NVCC appears to do is it creates a dummy instances of __device__ functions on the host side, that just call exit, and uses them for comparison. If you were to pass that pointer to the GPU side, it would likely be invalid there.

https://godbolt.org/z/fh1MME7Ev

I do not think it's a particularly sound approach, though it does maintain correctness for the operations on the pointers that do not cross host/device barrier and don't care about the correctness of actual values of the address.

IMO clang's approach is conceptually correct. That said, the example above is also somewhat sensible -- we may want to specialize some host code based on what's available on device.

I guess the question is -- what's the right thing to do here? I'm not sold on NVCC's approach. I think it goes too far. Allowing a subset of address comparisons may be OK, but it would have to be a special case with caveats. We can probably come up with a portable way to implement the desired specialization using existing clang. E.g. we could use an explicit __device__ variable to hold the address of GPU-side function and use that variable address for the host-side specialization.

E.g. https://godbolt.org/z/Ez7o66rv7

This works for clang now, but, curiously, nvcc's host-side compilation seems to have trouble with the device-side variable shadows.

We could implement such a host-side shadow for device functions referenced in the host code. It would let us have the cake (use a host-side representative of the device-side pointer) and eat it (correctly handle the cases when someone attempts to use the value of such pointer which is unknown). WDYT?