parallel-runtimes / lomp

Little OpenMP Library
Apache License 2.0
153 stars 17 forks source link

Implement missing entrypoints #23

Open JimCownie opened 3 years ago

JimCownie commented 3 years ago

Describe the bug The LOMP library is missing a non-trivial number of entrypoints which are in the LLVM runtime. Some of these are for backward compatibility, so may be ignored, and some for features which LOMP does not (yet) support (such as changing the team size dynamically at runtime, or nested parallelism). However it is useful to know what is missing!

To Reproduce Use shell tools (nm, grep, cut, [missing_symbols.txt](https://github.com/parallel-runtimes/lomp/files/6916609/missing_symbols.txt) sort, uniq -u) to get a list of entrypoints which are only in one library. Something like

$ nm ~/lomp/build/src/libomp.dylib | grep __kmpc | cut -f 3 -d ' ' | sort > lomp_symbols.txt
$ nm ~/software/clang-12.0.0/arm64/lib/libomp.dylib | grep __kmpc | cut -f 3 -d ' ' | sort > llvm_symbols.txt
$  cat lomp_symbols.txt llvm_symbols.txt | sort | uniq -u

The attached file shows the output for that set of commands. Some of the entries are likely errors in what the LLVM runtime is exporting, since they are C++ mangled names, similarly the complex10 ops (and atomic_{10,20}) represent support for X87 80b floats which is unnecessary on AArch64, and likely everywhere, in reality).

Expected behaviour LOMP should implement more of the LLVM interface.

Desktop (please complete the following information):

JimCownie commented 3 years ago

missing_symbols.txt

mjklemm commented 3 years ago

Some of these entrypoints should be available when the library is set to have the Intel-style entrypoints that clang does not use. -DLOMP_ICC_SUPPORT=[on|off] would do that.

Looking through the list, I think most of the ones that involve atomics can be auto-generated.

JimCownie commented 3 years ago

Yes, a combination of macros/templates should be able to do most of the atomic operations, though there are 3 different cases to consider: The type and operation are supported by std::atomic => use that. The operation is max/min, in which case a test and test&set style approach with CAS is required. (Check whether the value will cause a change before entering a CAS loop). The type or operation are not supported by std::atomic so a straightforward CAS implementation is required. (One could use the same style as for max/min, since some ops here may not need updates either, [logical bit ops, for instance, or FP add of numbers a long way apart], but it’s probably better just to do the obvious thing).

On 9 Aug 2021, at 06:51, Michael Klemm @.***> wrote:

Some of these entrypoints should be available when the library is set to have the Intel-style entrypoints that clang does not use. -DLOMP_ICC_SUPPORT=[on|off] would do that.

Looking through the list, I think most of the ones that involve atomics can be auto-generated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/parallel-runtimes/lomp/issues/23#issuecomment-894966858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7TVHPH64WL2IC366365DT35UERANCNFSM5BMSROBA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

-- Jim James Cownie @.***> Mob: +44 780 637 7146

mjklemm commented 3 years ago

Yes, a combination of macros/templates should be able to do most of the atomic operations, though there are 3 different cases to consider: The type and operation are supported by std::atomic => use that.

Well, that would mean that we'd need to construct a std::atomic object around an address that is passed into the __kmpc entrypoint. I might have missed it, but the std::atomic class does not have a constructor that would let us bind the atomic object to the memory address coming from the user code.

So, I guess that we cannot simply use std::atomic, but rather will have to roll our own CAS-style atomic operation.

The operation is max/min, in which case a test and test&set style approach with CAS is required. (Check whether the value will cause a change before entering a CAS loop). The type or operation are not supported by std::atomic so a straightforward CAS implementation is required. (One could use the same style as for max/min, since some ops here may not need updates either, [logical bit ops, for instance, or FP add of numbers a long way apart], but it’s probably better just to do the obvious thing).

I think I'd go with the easy way for now and then later specialize the operations that we can specialize for different architectures.

JimCownie commented 3 years ago

We are already making use of std::atomic for some operations, and CAS for others. See

define expandInlineBinaryOp(type, typetag, op, optag, reversed) \

void __kmpcatomic##typetag##_##optag(ident_t , int , type target, \ type operand) { \ std::atomic t = (std::atomic )target; \ t op## = operand; \ } and

define expandCasBinaryOp(type, typetag, mutator, optag, reversed)

(which is too long to quote here!)

Similarly, we’re already optimising the max/min via expandCasCheckedOp

So adding the missing functions shouldn't to be a big job. It’s probably just that some of the types have got missed out in the FOREACH_ macros in atomics.cc http://atomics.cc/.

On 10 Aug 2021, at 10:05, Michael Klemm @.***> wrote:

Yes, a combination of macros/templates should be able to do most of the atomic operations, though there are 3 different cases to consider: The type and operation are supported by std::atomic => use that.

Well, that would mean that we'd need to construct a std::atomic object around an address that is passed into the __kmpc entrypoint. I might have missed it, but the std::atomic class does not have a constructor that would let us bind the atomic object to the memory address coming from the user code.

So, I guess that we cannot simply use std::atomic, but rather will have to roll our own CAS-style atomic operation.

The operation is max/min, in which case a test and test&set style approach with CAS is required. (Check whether the value will cause a change before entering a CAS loop). The type or operation are not supported by std::atomic so a straightforward CAS implementation is required. (One could use the same style as for max/min, since some ops here may not need updates either, [logical bit ops, for instance, or FP add of numbers a long way apart], but it’s probably better just to do the obvious thing).

I think I'd go with the easy way for now and then later specialize the operations that we can specialize for different architectures.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/parallel-runtimes/lomp/issues/23#issuecomment-895860320, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7TVCSWPNXAZK6FSDR6FDT4DTW3ANCNFSM5BMSROBA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

-- Jim James Cownie @.***> Mob: +44 780 637 7146

mjklemm commented 3 years ago

I think that the way this code works is broken. The C++ spec says:

Note: The representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. —end note

So, it's not guaranteed that sizeof(std::atomic<T>) == sizeof(T), which is needed for the above code to work.

JimCownie commented 3 years ago

If we want to be paranoid about that we can easily enough put in an assertion to check…

Without that property it seems to me that we can't use std::atomic at all, for any of the reduction code. Even the CAS based reductions are using std::atomic on a value which has been allocated by the compiler elsewhere, so that can’t be relied on. We’d have to use our own asm code, which is a big portability issue.

So my preference is definitely to check, either with assertions in the code (which a decent compiler should be able to remove at compile time) :-

include

include

bool checkIntSize() { assert(sizeof(int) == sizeof(std::atomic)); return true; } => checkIntSize(): mov eax, 1 ret

Or, add a test that just checks them all directly.

I think the assertions are fine, so will add them.

On 10 Aug 2021, at 11:42, Michael Klemm @.***> wrote:

I think that the way this code works is broken. The C++ spec says:

Note: The representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. —end note

So, it's not guaranteed that sizeof(std::atomic) == sizeof(T), which is needed for the above code to work.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/parallel-runtimes/lomp/issues/23#issuecomment-895923671, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALY7TVHTQJ5OSSBG5KNVREDT4D7BBANCNFSM5BMSROBA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

-- Jim James Cownie @.***> Mob: +44 780 637 7146