Possible translation for OpenACC loop seq

intel / intel-application-migration-tool-for-openacc-to-openmp

OpenACC* to OpenMP* API assisting migration tool

BSD 3-Clause "New" or "Revised" License

32 stars 6 forks source link

Possible translation for OpenACC loop seq #24

Closed Lyphion closed 2 months ago

Lyphion commented 5 months ago

Currently OpenMP doesn't support the OpenACC loop seq construct and no direct translation is present/possible. A possible translation could be to use the bind(thread) construct instead. According to this paper and my own tests the following code snippets produce correct results with comparable performance.

OpenACC:

!$acc parallel
!$acc loop seq
do j = 1, n
!$acc loop
  do i = 1, n
    b(i) = b(i) / j + a(i,j)
  end do
end do
!$acc end parallel

OpenMP:

!$omp target teams
!$omp loop bind(thread)
do j = 1, n
!$omp loop
  do i = 1, n
    b(i) = b(i) / j + a(i,j)
  end do
end do
!$omp end target teams

For better transparency a feature flag is useful and appropriate.

Lyphion commented 5 months ago

After further investigation the correctness of the translation depends on the Compiler and used Hardware. When using Nvidia Tools and Hardware the translation is correct. With Intel the result doesn't match the expected one. For that reason, the possible translation should be included in the experimental section.

hservatg commented 5 months ago

Hey @Lyphion -- do you mind sharing which intel compiler did you try? Thanks

I'm a bit swamped these days -- but I'll try to work on this when I have some time.

Lyphion commented 5 months ago

All my tests are done with Fortran.

ifx 2024.1.2 or 2024.2 for Intel (the old Fortran Compiler Classic doesn't support my hardware)
nvfortran 24.3 for Nvidia

This was just an idea, if you like it but don't have much time, I could also design a implementation/draft.

hservatg commented 5 months ago

Hello,

I'm not sure about this proposal. According to the OpenACC spec for loop construct / seq:

2153 2.9.5 seq clause
2154 The seq clause specifies that the associated loop or loops are to be executed sequentially by the
2155 accelerator. This clause will override any automatic parallelization or vectorization.

however, a !$omp loop bind(thread) would parallelize the loop construct over the threads and that would not honor the OpenACC semantics of the original code.

The example you posted works because the parallel region does not spawn threads (or workers in OpenACC jargon). However, what if threads/workers are spawned? Not sure that the translation using your suggestion would be valid.

Lyphion commented 5 months ago

I know that this is more like a shortcut or hack. As I already mentioned it doesn't work on all platforms for that reason. But in some instances it really helps with the performance and in the case of the Nvidia Compiler it prints the same Debug-Log when compiling. Converting an outer sequential loop into an OpenMP construct would require to spawn a new kernel on each iteration which hurt the performance.

Thanks for investigating my idea. The documentation/manual of OpenMP and OpenACC are a bit confusing and open in some parts.

If you are skeptical about it, we can leave it as it is and I refactor my code on my side without tool support.

hservatg commented 4 months ago

I've been thinking on the topic and discussing it with some colleagues. I think that the appropriate solution would be to translate the !$acc loop seq into a no-op (currently it is translated as !$omp loop, which is wrong). Basically !$acc loop seq prevents a loop of being parallelized by the OpenACC compiler -- so it shall run serially by a given thread.

Sorry if this does not align with your expectations but this shall be the most semantically equivalent translation.

Lyphion commented 4 months ago

I totally agree with you about the solution. For my own testing I also tried translating it into a no-op and it work good enough for me. The user must keep in mind, that all instructions between the outer sequential loop (!$acc loop seq) and a inner parallel one are most likely run by all threads, so nothing should be calculated/saved here.

I'd like to thank you again for checking and researching. Your tool and feedback really helped me.

hservatg commented 2 months ago

I finally implemented the "not translation" for omp loop seq. Sorry if that does not match your initial expectations, but I think this is the way to go.