lcompilers / lpython

Python compiler
https://lpython.org/
Other
1.48k stars 157 forks source link

Design of specific backends (GPU, OpenMP, Coarrays, etc.) #1996

Open certik opened 1 year ago

certik commented 1 year ago

So far most of the code that our backends (C, LLVM, WASM) generate does not depend on any third party API, it just uses native operations (such as arithmetic) of the given platform, sometimes libc, and sometimes it calls into our own runtime library.

We now have to figure out how to target backends that make heavy use of a custom 3rd party API (typically C) to do all the operations. Examples of such backends:

The two approaches are:

We can use a combination of the two approaches. But the second approach is preferable, since we can see how the code looks like after the transformation (of "do concurrent" into OpenMP or CUDA) and optionally apply more ASR->ASR passes further optimizing the code; we can use our verify() to check correctness; and all backends will work with no special support.

rebcabin commented 1 year ago

Yes: delaying commitments to API details to the last possible phase preserves options in the earlier phases. Once those commitments are made, all downstream phases are stuck with them :)

On Wed, Jun 21, 2023 at 4:26 PM Ondřej Čertík @.***> wrote:

So far most of the code that our backends (C, LLVM, WASM) generate does not depend on any third party API, it just uses native operations (such as arithmetic) of the given platform, sometimes libc, and sometimes it calls into our own runtime library.

We now have to figure out how to target backends that make heavy use of a custom 3rd party API (typically C) to do all the operations. Examples of such backends:

  • SymEngine symbolic backend
  • GPU and other accelerator backends
  • OpenMP
  • Coarrays
  • pthreads

The two approaches are:

  • We represent the operations in ASR, either with backend explicit nodes, or with higher level operations (parallel do concurrent). Each backend then to implement translating the operation to specific API calls (say OpenMP or SymEngine). Each backend has to reimplement it.
  • We do this translation as ASR->ASR pass. The input is, say, do concurrent, and the output is ASR with specific calls to OpenMP (if OpenMP is used) or GPU API (if GPU offloading is used). All backends work with it.

We can use a combination of the two approaches. But the second approach is preferable, since we can see how the code looks like after the transformation (of "do concurrent" into OpenMP or CUDA) and optionally apply more ASR->ASR passes further optimizing the code. And we can use our verify() to check correctness.

— Reply to this email directly, view it on GitHub https://github.com/lcompilers/lpython/issues/1996, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSRR5UX6ENVTVNR3VQYODXMN7LPANCNFSM6AAAAAAZPNMIOM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

certik commented 1 year ago

I think we can always write an ASR->ASR pass that implements some feature using some low level API. We don't have to use it, but when we use it, then the backends just work. The downside is a possibly slower compilation speed. Later, we can "fuse" it with the backend (=implement it in the backend directly), as long as it can be done in a maintainable way. The ASR passes thus act as the quickest and cleanest way to get all the features in that we need. Later, we can decide not to run some, and instead do it directly in the backend, as a compiler speed optimization.