Closed Lyphion closed 2 months ago
After further investigation the correctness of the translation depends on the Compiler and used Hardware. When using Nvidia Tools and Hardware the translation is correct. With Intel the result doesn't match the expected one. For that reason, the possible translation should be included in the experimental section.
Hey @Lyphion -- do you mind sharing which intel compiler did you try? Thanks
I'm a bit swamped these days -- but I'll try to work on this when I have some time.
All my tests are done with Fortran.
This was just an idea, if you like it but don't have much time, I could also design a implementation/draft.
Hello,
I'm not sure about this proposal. According to the OpenACC spec for loop construct / seq:
2153 2.9.5 seq clause
2154 The seq clause specifies that the associated loop or loops are to be executed sequentially by the
2155 accelerator. This clause will override any automatic parallelization or vectorization.
however, a !$omp loop bind(thread)
would parallelize the loop construct over the threads and that would not honor the OpenACC semantics of the original code.
The example you posted works because the parallel region does not spawn threads (or workers in OpenACC jargon). However, what if threads/workers are spawned? Not sure that the translation using your suggestion would be valid.
I know that this is more like a shortcut or hack. As I already mentioned it doesn't work on all platforms for that reason. But in some instances it really helps with the performance and in the case of the Nvidia Compiler it prints the same Debug-Log when compiling. Converting an outer sequential loop into an OpenMP construct would require to spawn a new kernel on each iteration which hurt the performance.
Thanks for investigating my idea. The documentation/manual of OpenMP and OpenACC are a bit confusing and open in some parts.
If you are skeptical about it, we can leave it as it is and I refactor my code on my side without tool support.
I've been thinking on the topic and discussing it with some colleagues. I think that the appropriate solution would be to translate the !$acc loop seq
into a no-op (currently it is translated as !$omp loop
, which is wrong). Basically !$acc loop seq
prevents a loop of being parallelized by the OpenACC compiler -- so it shall run serially by a given thread.
Sorry if this does not align with your expectations but this shall be the most semantically equivalent translation.
I totally agree with you about the solution. For my own testing I also tried translating it into a no-op and it work good enough for me. The user must keep in mind, that all instructions between the outer sequential loop (!$acc loop seq) and a inner parallel one are most likely run by all threads, so nothing should be calculated/saved here.
I'd like to thank you again for checking and researching. Your tool and feedback really helped me.
I finally implemented the "not translation" for omp loop seq
. Sorry if that does not match your initial expectations, but I think this is the way to go.
Currently OpenMP doesn't support the OpenACC
loop seq
construct and no direct translation is present/possible. A possible translation could be to use thebind(thread)
construct instead. According to this paper and my own tests the following code snippets produce correct results with comparable performance.OpenACC:
OpenMP:
For better transparency a feature flag is useful and appropriate.