Deep accèss to local coordinates

paboyle commented 4 years ago

Hello,

CUDA gives access to threadIdx.x and blockIdx deep in the application as a thread specific global variable, without passing it around from the top level kernel as arguments.

This is semantically different to SyCL.

However we noted internally that SyCL uses

__spirv_BuiltInGlobalInvocationId

There are two good reasons to expose and document accessors for invocationId.

i) Compatability with CUDA semantics. I will have to globally rewrite 200k lines of code

ii) [even better] I am building my application with two levels of parallelism. a) loop level b) vector level

On my SyCL prototype www.github.com/paboyle/GridBench I'm using compiler vector attributes to get good performance on CPU and GPU.

Kernels that may be offloaded go through a

"coalescedRead / coalescedWrite "

Internal interface.

The API if compiling for device SYCL_DEVICE_ONLY, returns a scalar for "this" SIMD lane. The API if compiling for CPU returns the vector.

This way I code both VECTOR SIMD and GPU SIMT in a single code, and gives me both SIMD vectorisation on host code, and SIMT coalescing on device code.

That requires the "coalescedRead" function to find out/know what SIMD lane it is to operate on, but only if the compile target is GPU.

This is best done with implicit rather than passed access to the SIMD lane.

I gave a poster on this at the ECP conference in Houston in Feb.

AlexeySachkov commented 4 years ago

@Pennycook, @rolandschulz, @mkinsner: FYI

keryell commented 4 years ago

CUDA gives access to threadIdx.x and blockIdx deep in the application as a thread specific global variable, without passing it around from the top level kernel as arguments.

This is semantically different to SyCL.

Yes, this is how the SYCL standard is designed on purpose. SYCL is a portable pure C++ executable DSL which can be also executed on any CPU with any plain C++ compiler. This is extremely powerful to debug for example SYCL on a CPU with your normal development environment.

CUDA is a language similar to C++, with some magical extensions such as the one you mention. Unfortunately, this means it requires a specific compiler to implement them and so, you cannot execute CUDA directly on a CPU to debug it or just if you do not have an Nvidia GPU... :-(

However we noted internally that SyCL uses

__spirv_BuiltInGlobalInvocationId

This is an implementation detail when you use SPIR-V. For another back-end it could be different. And actually the one I work on use intrinsics functions instead of intrinsics variables...

There are two good reasons to expose and document accessors for invocationId.

i) Compatability with CUDA semantics. I will have to globally rewrite 200k lines of code

This is unfortunate...

ii) [even better] I am building my application with two levels of parallelism. a) loop level b) vector level

You could also use SYCL hierarchical parallelism to address these 2 aspects.

On my SyCL prototype www.github.com/paboyle/GridBench I'm using compiler vector attributes to get good performance on CPU and GPU.

If you are allowed to use compiler extensions, then you can probably define your own threadIdx.x and blockIdx as compiler extensions and SPIR-V intrinsics that would work for this implementation when you use the SPIR-V backend.

I gave a poster on this at the ECP conference in Houston in Feb.

Do you have a link to it?

paboyle commented 4 years ago

I can use compiler extensions, but would prefer a guarantee that such an extension would exist, even if it is compiler dependent. I actually prefer an intrinsic function to an intrinsic variable from an aesthetic sense.

I'm trying to implicitly process vector types at the lowest level in code as one "extra" work item per SIMD lane. This keeps the semantics the same in rest of code.

But, SyCL (including hierarchical parallelism) forces me to pass variables down through many preexisting layers in code.

Consequently I will define my own inline function "int SIMTlane(void)" under SYCL_DEVICE_ONLY or CUDA_ARCH that wraps and hides

a) SPIR-V intrinsics under Intel SYCL b) threadIdx under CUDA c) ... other implementations of SYCL if necessary .. but I do very much wish there were at least a guarantee that such an intrinsic actually existed in each implementation of SYCL.

paboyle commented 4 years ago

Poster:

www.ph.ed.ac.uk/~paboyle/ECP_2020.pdf

paboyle commented 4 years ago

Does C++11

thread_local

specifier work with DPCPP kernels? If so this could be used.

romanovvlad commented 4 years ago

thread_local is not allowed to be used in the device code. From the spec: Variables with thread storage duration (thread_local storage class specifier) are not allowed to be odr-used in kernel code.

paboyle commented 4 years ago

thread_local is compliant C++, and if not supported, could at least be used to implement the host CPU version of a documented API call to get the global work item ID with the desired semantics.

Seems to me that you are making a benefit of not providing complete C++ support, saying the semantics don't match GPU's when it suits, while making a virtue of the C++ standard nature.

But not providing the alternate solution that is semantically present in CUDA for the bits you miss out. This could be made an API call that works across both CPU and GPU, simply by sticking the work item in a thread_local variable accessed via API function on CPU, and by giving access via the same API function to a device SPR on GPUs.

keryell commented 4 years ago

Even inside the ISO C++ committee thread_local is already a tough subject on CPU with a lot of problems.

The good news inside SYCL or CUDA kernels: all the variables you just declare normally are in the private address space, which means... thread_local from a SIMT perspective. :-)

Or do you need something else for your application?

paboyle commented 4 years ago

OK, then pthread_setspecific, pthread_getspecific.

Have existed for decades.

paboyle commented 4 years ago

and thread_local is important because it can make a variable visible across multiple functions scopes, which is different from an automatic local variable within a function scope, and you know that, so that was a nonsense reply

keryell commented 4 years ago

I am unsure we are talking about the same thing.

bjoo commented 4 years ago

Hi, I brought up this issue in a discussion with @Pennycook and Jeongnim today and I am posting this on @Pennycook's suggestion. I would like to add my voice to @paboyle on this. Passing ID information down to lower functions is not always easy. The case of nested operator overloads as @paboyle noted is a good illustration. I anticipate I will have a similar need to @paboyle in the near-to-intermediate future.

Once a parallel region has been launched the thread, workgroup and subgroup information must be available whatever the backend (e.g. the SPIRV intrinsic call @paboyle found in the case of the SPIRV backend or CUDA & HIP's threadIdx.x in a CUDA/HIP backend). This information could be exposed in a standard way as a DPC++ extension to SYCL although it is possible that it may need compiler support which would go against the SYCL approach that it has to be implementable as a standard C++ library as earlier commenters, and @Pennycook on the call, had noted.

There are also other viewpoints to consider: would access to the ID information allow users to break things accidentally? My personal feeling is that if adding these calls extends current behaviour, existing SYCL/DPC++ code oughtn't suffer. New code may, but the older ways of working presumably would still be available.

Another viewpoint is: Would using these features on one architecture result in lack of portability to other systems for any reason? My answer would be to let the user of the features beware.

On the point of portability: In the end not all users are using SYCL/DPC++ as a way to achieve performance portability. Users may already have other existing back-ends to target platforms where the programming model of choice is not DPC++/SYCL (e.g. CUDA back ends for @paboyle or, or for Kokkos and Raja) and the use of SYCL/DPC++ can be just to develop yet another back-end to target some specific architectures where it is the primary available programming model. Looking at it from this point of view it makes sense to use documented extensions which are not standard SYCL at the moment, but which both are needed for efficiency and/or a better software design. If these extensions make the various programming models used in the back ends look similar (again, a case in point is the USM extension, which allows direct control on memory management and is similar to how that is done in CUDA and HIP) that helps the portability of the underlying software design. The fact that compiler support is needed to implement the extension may not be a primary issue to the end user (who often has to use a custom/customized compiler/compiler features anyway).

So to cut my rambling short: a +1 (up-vote) on threadID query function features from me. I would be happy to follow this discussion if the conversation turned to a discussion of how to implement these features (i.e. what the calls would look like, what the side effects could be etc).

paboyle commented 4 years ago

Hi,

I've committed a first cut of a branch of Grid that compiles through dpcpp:

https://github.com/paboyle/Grid/tree/sycl

This requires the deep access to one of the local coordinates, taken in:

https://github.com/paboyle/Grid/blob/sycl/Grid/tensors/Tensor_SIMT.h

Line 85:

#ifdef GRID_SYCL
accelerator_inline int SIMTlane(int Nsimd) { return __spirv::initLocalInvocationId<3, cl::sycl::id<3>>()[2]; }
#endif

Please document a way to access LocalInvocationId deep in the code.

rolandschulz commented 4 years ago

Yes, this is how the SYCL standard is designed on purpose. SYCL is a portable pure C++ executable DSL which can be also executed on any CPU with any plain C++ compiler. This is extremely powerful to debug for example SYCL on a CPU with your normal development environment.

I don't think this is a reason not to add this feature. The only way to implement it in pure C++ (no OpenMP, pthread, ...) is to implement it on top of std::thread. As @paboyle has written above, for std::thread it is possible to implement this feature on top of thread_local. You simply store the ID in thread_local prior to starting the lambda for a specific ID. This works whether you have 1 thread per ID or less than 1 thread per ID and loop (later is only valid for range but not nd_range). AFAIK thread_local works great for this use case.

Can you clarify why such a use would be controversial? My understanding is that thread_local is problematic in the standard because it is unclear for which threads it is available. You can't use it for e.g. TBB/tasklets/.... But I'm not aware of any spec compliant implementation of std::thread for which thread_local causes any problems. The only realistic way to implement std::thread is to map it to OS threads in which case thread_local is easy to do.

If you do implement it as a pure library on top of some other language extension or library (OpenMP, pthread, TBB, ...), I'm not aware of any runtime which doesn't provide some mechanism to store or query an id for each thread/tasklet/workitem/... or whatever the concept of a worker is called in each of them. Therefore in all these cases it's also possible to implement this feature.

If this is a much requested feature, clearly improves usability, and can be implemented, why shouldn't we add it to SYCL?

keryell commented 4 years ago

I read again the discussion thread and I think I have been derailed by the thread_local discussion. In the same way we have std::this_thread and sycl::this_module names spaces, we could have a sycl::this_work_item with some work-item-related introspection functions, such as the id or item or... Would it be possible to have an implementation which is efficient on non-GPU (CPU, FPGA) and also that does not slow down the case when this namespace is not used. Should we have a special parallel_for for this? I would not like to have the current problem of barriers and nd_range on CPU...

rolandschulz commented 4 years ago

In the same way we have std::this_thread and sycl::this_module names spaces, we could have a sycl::this_work_item with some work-item-related introspection functions, such as the id or item or...

exactly

Would it be possible to have an implementation which is efficient on non-GPU (CPU, FPGA) and also that does not slow down the case when this namespace is not used. Should we have a special parallel_for for this?

I think the only possible slowdown I'm aware of, is that a some implementations might want to store the ID info in some way as part of parallel_for to speed up ID query functions. But storing a single number should be fast enough that I don't think we need a special parallel_for for that.

Pennycook commented 4 years ago

I think the only possible slowdown I'm aware of, is that a some implementations might want to store the ID info in some way as part of parallel_for to speed up ID query functions. But storing a single number should be fast enough that I don't think we need a special parallel_for for that.

The fact that an implementation has to do something different here is reason enough to consider a different form of parallel_for, or at least an additional argument allowing the user to declare whether they're using this feature or not.

I've heard anecdotal evidence from some of the Kokkos developers that minor startup costs like these become significant at scale, and they've found that giving users the ability to declare when they're not using certain features is critical for enabling high performance.

paboyle commented 4 years ago

If you want my take:

http://david-grs.github.io/tls_performance_overhead_cost_linux/ https://www.akkadia.org/drepper/tls.pdf

An interesting blog post on how TLS is implemented - I had assumed falling back to pthread_set_specific, but looks better than that. ABI in x86-64 appears to reserve an integer register for a pointer to thread descriptor (struct pthread / thread control block) and it's a second pointer indirect to get to the TLS region.

Statically linked particularly looks ok, but subject to benchmarking of course.

Looks like the overhead is a one deep pointer chase (dereference TLS pointer from the TCB pointer in predefined ABI reg), and then storing the work item at a fixed linker/ELF determined offset relative to the TLS base pointer.

Probably not a big overhead unless dealing with very small kernels, and if SYCL uses a thread pool, these should live in the caches after a worker thread has done its first work item.

Pennycook commented 4 years ago

Sorry, I should have been clearer -- I'm deliberately not trying to focus on any one implementation, here, but think more generally about possible implementations for a wider set of devices.

Today, there is a semantic difference between parallel_for(range) and parallel_for(nd_range); a developer cannot call a barrier in the former, but can in the latter. Implementations can use that for optimizations. Such optimizations would be device-specific, but one can imagine what they might look like (e.g. use different scheduling of work-items to hardware resources, skip initialization of barrier variables and/or hardware, skip initialization of TLS).

If a developer can construct something like an nd_item anywhere in their code, they can now call a barrier anywhere in their code. If a developer writes the following and it doesn't work, who is to blame?

// parallel_for(range) so barrier isn't supported
cgh.parallel_for(range<1>(N), [=](id<1>) {
  nd_item<1> it = this_nd_item<1>();
  it.barrier(); // user called a barrier anyway
});

If we want to enable implementations to optimize for certain cases and want users to have access to these global queries, there needs to be a clear contract between the implementation and user. That's all I'm saying.

Pennycook commented 4 years ago

@rarutyun has been working on an extension that adds these functions: the extension specification is here, and the implementation is being reviewed in https://github.com/intel/llvm/pull/2351.

bader commented 4 years ago

Addressed by #2351.

intel / llvm

Deep accèss to local coordinates #1172