Compile-time definitions to detect used SYCL targets

al42and commented 2 years ago

Is your feature request related to a problem? Please describe

One or more targets can be passed to -fsycl-targets. For the program being compiled, it can be beneficial to know at compile time which targets were used. In my use case, different flavors of a kernel are used for different architectures (NVIDIA, Intel). If a certain architecture is not among the targets, one can skip compiling the corresponding flavor.

An additional benefit is an ability to early filter-out outright-incompatible devices (sycl::is_compatible is more robust, but does not appear to be working at the moment: #5561).

Describe the solution you would like

Have a compile-time macro for each target marking whether kernels are expected to be compiled for this target.

Describe alternatives you have considered

Add a constexpr aspect/flag to sycl::backend.

romanovvlad commented 2 years ago

@al42and Could you please provide an example of the code which uses such a compile-time macro?

al42and commented 2 years ago

Here is a specific discussion we had with @anton-v-gorshkov about our use case: https://gitlab.com/gromacs/gromacs/-/merge_requests/2248/diffs#note_743987749

The code there might be overly convoluted for historical reasons. But, in essence, something like this is attempted:

template<int subGroupSize>
void submitKernel(sycl::queue q, sycl::global_ptr<float> data, int size) {
    // Submit the kernel which uses sub-group functionality
    // Kernel is complex and takes a long time to compile
}

void doStuff(sycl::device dev, sycl::queue q, sycl::global_ptr<float> data, int size) {
    switch(getVendor(dev)) {
        case Vendor::Nvidia:
            #if HAVE_NVIDIA
            return submitKernel<32>(q, data, size);
            #else
            assert(false); // Don't instantiate the template for 32, don't waste time compiling it.
            #endif
        case Vendor::Intel:
            #if HAVE_INTEL
            return submitKernel<16>(q, data, size);
            #else
            assert(false);  // Don't instantiate the template for 16, don't waste time compiling it.
            #endif
    }
}

EDIT: Subgroup size is the most obvious example. There might be other differences, e.g. whether to manually prefetch some values.

EDIT2: As a workaround for faster compilation here, one can do an early return in the kernel (e.g., if (defined(__NVPTX__) && subGroupSize != 32)). But that does not help with other issues, like filtering out incompatible devices early.

elizabethandrews commented 2 years ago

@al42and just to make sure I understand requirement - you want these macros set during host compilation? For device compilation we have existing macros like NVPTX, etc

al42and commented 2 years ago

@al42and just to make sure I understand requirement - you want these macros set during host compilation? For device compilation we have existing macros like NVPTX, etc

Yes, I specifically want to check in the host code which offload architectures are enabled.

elizabethandrews commented 2 years ago

@al42and can the type trait any_device_has<aspect> be used for this purpose? It is defined in SYCL 2020 spec as follows -

The implementation also provides two traits that the application can use to query aspects at compilation time. The trait any_device_has inherits from std::true_type if the compilation environment supports any device which has the specified aspect, and it inherits from std::false_type if no device has the aspect. The trait all_devices_have inherits from std::true_type if all devices supported by the compilation environment have the specified aspect, and it inherits from std::false_type if any device does not have the aspect.

We are considering adding an extended aspect for each device type. For example, we might define aspects "aspect::ext_oneapi_intel_gpu" and "aspect::ext_oneapi_nvidia_gpu". Application can then be -

void doStuff(sycl::device dev, sycl::queue q, sycl::global_ptr<float> data, int size) {
    switch(getVendor(dev)) {
        case Vendor::Nvidia:
          if constexpr (sycl::any_device_has_v<sycl::aspect::ext_oneapi_nvidia_gpu>) {
            return submitKernel<32>(q, data, size);
          } else {
            assert(false); // Don't instantiate the template for 32, don't waste time compiling it.
          }
        case Vendr::Intel:
          if constexpr (sycl::any_device_has_v<sycl::aspect::ext_oneapi_intel_gpu>) {
            return submitKernel<16>(q, data, size);
          } else {
            assert(false);  // Don't instantiate the template for 16, don't waste time compiling it.
          }
    }
}

elizabethandrews commented 2 years ago

After discussions with the team, the consensus is that we will be implementing the macros as an extension

al42and commented 2 years ago

@elizabethandrews, the solution with sycl::any_device_has is indeed more elegant and sycl-esque than macros. As far as I can tell, it solves my problem perfectly.

That said, macros are ok too.

elizabethandrews commented 2 years ago

I believe macros help with avoiding all compile time overheads and offers more flexibility in some cases. So there is some sentiment to support it as well. Users can then choose to use whatever best suits their application

al42and commented 7 months ago

Hi!

Any progress on this?

I see that oneMKL project also has to manually parse the compiler flags in CMake to get the list of targets, and this is, to be honest, not a pretty solution.

intel / llvm

Compile-time definitions to detect used SYCL targets #5562