Design of specific backends (GPU, OpenMP, Coarrays, etc.)

certik commented 1 year ago

So far most of the code that our backends (C, LLVM, WASM) generate does not depend on any third party API, it just uses native operations (such as arithmetic) of the given platform, sometimes libc, and sometimes it calls into our own runtime library.

We now have to figure out how to target backends that make heavy use of a custom 3rd party API (typically C) to do all the operations. Examples of such backends:

SymEngine symbolic backend
GPU and other accelerator backends
OpenMP
Coarrays
pthreads
CPython interoperability

The two approaches are:

We represent the operations in ASR, either with backend explicit nodes, or with higher level operations (parallel do concurrent). Each backend then has to implement translating the operation to specific API calls (say OpenMP or SymEngine).
We do this translation as ASR->ASR pass. The input is, say, do concurrent, and the output is ASR with specific calls to OpenMP (if OpenMP is used) or GPU API (if GPU offloading is used). All backends work with it.

We can use a combination of the two approaches. But the second approach is preferable, since we can see how the code looks like after the transformation (of "do concurrent" into OpenMP or CUDA) and optionally apply more ASR->ASR passes further optimizing the code; we can use our verify() to check correctness; and all backends will work with no special support.

rebcabin commented 1 year ago

Yes: delaying commitments to API details to the last possible phase preserves options in the earlier phases. Once those commitments are made, all downstream phases are stuck with them :)

On Wed, Jun 21, 2023 at 4:26 PM Ondřej Čertík @.***> wrote:

So far most of the code that our backends (C, LLVM, WASM) generate does not depend on any third party API, it just uses native operations (such as arithmetic) of the given platform, sometimes libc, and sometimes it calls into our own runtime library.

We now have to figure out how to target backends that make heavy use of a custom 3rd party API (typically C) to do all the operations. Examples of such backends:

SymEngine symbolic backend

GPU and other accelerator backends

OpenMP

Coarrays

pthreads

The two approaches are:

We represent the operations in ASR, either with backend explicit nodes, or with higher level operations (parallel do concurrent). Each backend then to implement translating the operation to specific API calls (say OpenMP or SymEngine). Each backend has to reimplement it.

We do this translation as ASR->ASR pass. The input is, say, do concurrent, and the output is ASR with specific calls to OpenMP (if OpenMP is used) or GPU API (if GPU offloading is used). All backends work with it.

We can use a combination of the two approaches. But the second approach is preferable, since we can see how the code looks like after the transformation (of "do concurrent" into OpenMP or CUDA) and optionally apply more ASR->ASR passes further optimizing the code. And we can use our verify() to check correctness.

— Reply to this email directly, view it on GitHub https://github.com/lcompilers/lpython/issues/1996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSRR5UX6ENVTVNR3VQYODXMN7LPANCNFSM6AAAAAAZPNMIOM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

certik commented 1 year ago

I think we can always write an ASR->ASR pass that implements some feature using some low level API. We don't have to use it, but when we use it, then the backends just work. The downside is a possibly slower compilation speed. Later, we can "fuse" it with the backend (=implement it in the backend directly), as long as it can be done in a maintainable way. The ASR passes thus act as the quickest and cleanest way to get all the features in that we need. Later, we can decide not to run some, and instead do it directly in the backend, as a compiler speed optimization.

lcompilers / lpython

Design of specific backends (GPU, OpenMP, Coarrays, etc.) #1996