parallel-runtimes / lomp

Little OpenMP Library
Apache License 2.0
153 stars 17 forks source link

WIP: Implement a serial version of the LOMP library #27

Closed mjklemm closed 2 years ago

mjklemm commented 3 years ago

@JimCownie This is not yet ready to be merged. I'm adding the PR, so that you can have a look at it.

JimCownie commented 3 years ago

Looks reasonable so far. It might be useful to change every use of thread_local to be inside a suitably named macro, so that in the single-threaded library case only the macro needs to be changed. (There it would expand to nothing...)

mjklemm commented 3 years ago

I have found one more instance of thread_local. We only have like three places in the library right now and I think we can go without the macro to map thread_local to nothing for now.

mjklemm commented 3 years ago

I guess a more important question would be if we should do further (static) optimizations by turning locks and barriers off completely.

JimCownie commented 3 years ago

should we do further (static) optimizations by turning locks and barriers off completely.

That somewhat depends on what the intended use case is. If it's to allow OpenMP compiled, dynamically linked, code to exploit a single core more effectively, then such optimisations seems entirely reasonable. However, if it's to support profiling/simulation of a kernel, then removing all of the atomic operations may be misleading, and may encourage the "no need to optimize atomics; they never happen" mindset that comes from single-threaded simulation.

mjklemm commented 3 years ago

Indeed. We can always make that a configuration option. But that means that we will suffer from a too large test space to test for all different configuration options.

JimCownie commented 3 years ago

I think leaving the atomics and so on in there is likely less work, and would generate a saner instruction trace. So I'd do that for now... (though it does also lead to more uninteresting code, such as memory allocation and lock initialisation, having to be executed).

mjklemm commented 3 years ago

Yes, but that's also reasonable overhead that you might want to see. So, let's leave the implementation as is it now and see if it's useful to the intended usage.