[x] I have read CONTRIBUTING.md.
[x] I have checked that there is no existing PR/issue about my proposal.

Summary

Since Idris2 model effects as Monad, it's possible to evaluate independent terms independently.

Ideally, the OS should provide the means to do this easily. It is theoretically possible to add hardware-support for massive threading too, but we don't have that yet.

Motivation

The desire to make the compiler decide when to run code in parallel.

The proposal

Let's say you want to run main fast

foo : Nat -> Nat -- expensive computation

main : IO ()
main = do
    let a = foo 0
    let b = foo 1
    let c = foo a
    let d = foo b
    putStrLn $ show $ c == d

If the backend has support for threading, running main sequentially is a waste of time.

Modern C compilers are good at rewriting code to produce less instructions. I think it's reasonable to expect the compiler to parallelize code too.

Actually, somebody already made a runtime for functional languages: https://github.com/Kindelia/HVM

Technical implementation

Use HVM

From HVM's readme, it looks like it does not have FFI.

For it to work, we need to

Add FFI to HVM
Make codegen for HVM

Existing Backends

A simple implementation would let the backend implement the following functions.

run_in_parallel : (Lazy a, Lazy b) -> (a, b)

||| This Monad can be forked
interface Forkable (m: Type -> Type) where
  fork : (m a, m b) -> m (a, b)

Forkable IO where -- backend-specific impl. here

According to type signature: If one thread dies unexpectedly, we should kill all threads immediately.

Then the backend should use a user-space scheduler (like in Erlang or Pony) to spread the work to limited OS threads.

We still have to decide whether to fork at a junction. This probably requires to calculate approximate runtime of a function.

New Pragma

Sometimes the compiler can get it wrong, so we need these for manual control.

Suggestions:

%parallel_affinity never
%parallel_affinity always

foo : () -- do something here
-- set the fork overhead to the runtime of `foo` when deciding whether to fork
%fork_overhead foo
%fork_overhead 0
%fork_overhead Infinity

Alternatives considered

Future IORef and fork: this is too slow, and may not let the compiler/backend to choose to run both threads in sequence (without threading support). It also involves major refactoring in order to use the library. Implementation: https://gist.github.com/locriacyber/b315eb2654019d911f11da89703192a7

Conclusion

Since Idris langauge doesn't have strict semantic for sequential execution (unlike C), it is possible to let the compiler&backend parallelize code automatically. I think we should make it so for Idris.

Why do you think these things should be primitives of the language? I mean, both concurrency (without parallelism) and parallelism (even without concurrency) is a very complex thing and I feel this shall be managed by separate libraries (actually, pretty complex, if we'd look at such things in the rest of FP world, say, Haskell or Scala).

There is no unique right way to do this, so as there is no unique set of concurrent or parallel combination functions (say, you consider a fiction like m a -> m b -> m (a, b), why don't you consider its co-variant m a-> m b -> m (Either a b), known as race). There is a bunch of research in how concurrenct or parallel computations can exchange data or synchronise predictably and implementation of this also would require contradictive decisions, which is okay for library but seems wrong to the compiler itself.

Moreover, compiler already provides facilities of FFI to make anything backend-specific to be sent to there. Say, do-syntax (probably, with addition of some macro stuff) is also very flexible to make libraries reorder computations.

So, if you think these facilities are not sufficient to implement concurrency or parallelism libraries, then let's improve them on the compiler side. This is definitely a responsibility of the compiler. But I think that particular (controversal) set of primitives at the moment is not. What do you think?

We can use a library if we want to have manual control over whatever concurrency style to use where.

This issue is focused on providing structured concurrency, where spawned task cannot be interrupted, and must be waited to get its result. As I said, to do this automatically, we need the backend to measure or approximate runtime of different functions.

One key point of this issue is to make existing programs run faster automatically. From my knowledge, we humans seldom understand computer performance without measuring. Ideally, the compiler sits between the OS and a running program, and optimize it through experiment and performance measurement.

Then, the compiler can choose to

optimize a part more
run a part asynchronously
Use FPGA or other hardware-accelerated feature to compile the part to hardware circuits.

It may be possible that in the future, we will have L1, L2 hardware-accelerated algorithm cache. Today, we can manually add custom instructions to softcore CPU; we have accurate causal profiler. It's not a dream to let the OS decide . The catch is, we cannot feed linear instructions, but control-flow-graph to the OS (marking independent code paths).

The example Idris code in the proposal would be compiled to something like this:

IO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~x~~~~~
                                               ^
0 -> foo >-> foo >-> | (==) | >-> show >-> printStrLn
1 -> foo >-> foo >-> |      |

Example with independent effects (maybe to control 2 different peripherals):

effect_a ~x~~~~~~~~~~~~~~~~~~~~~x~~
          ^                     ^  
        |do_a| >-> indep. >-> |do_a|
        |    | >-> do_b >---> |    |
                    v           
effect_b ~~~~~~~~~~~x~~~~~~~~~~~

("waiting for a task" can be translated cleanly using rendezvous/blocking-style RPC like in seL4; the above program need two threads. Think how you would write this using a library for concurrency, compared to the OS handling this transparently.)

So why don't we have this feature today? I think it's influenced by the idea that a computer runs program in sequence - I don't know any "fork" instruction in mainstream ISA (like RISCV). Even if we do have this feature, programs written in C are impossible to parallelize, since a(b, c) means b -> c -> a, and side effects don't like to be interleaved.

More about independent effects

Here are some ways to represent causality explicitly when having multiple effects at once.

Causality via entangled effects

The concept of "entangled" effects is important when deciding effect independence. It is not yet in Idris.

E means the two effects are entangled (not independent)

Algorithm 1:

sys ~~~~~~E~~x~~~~~~~~~~~~~x~~~~
 console  \~~|~~~~~x~~~~~~~|~~~~
             ^     ^       ^
           open >--|---> close
     "Hello" >-> print

Algorithm 2:

sys ~~~~~~~~~x~~~~~~~~~~~~~x~~~~
console ~~~~~|~~~~~x~~~~~~~|~~~~
             ^     ^       ^
           open >--|---> close
     "Hello" >-> print

In Algorithm 1, the handler of effect console may produce effect sys, and the order of open print close is certain. The runtime of open print close may have some overlap in time though.

See the frame-time diagram below. Note even the function open and print overlap in time, print still happens strictly after open in terms of causality. The implementation (in handler) of open may process the input before syscall, and process the output after. This compiler should understand this and split open into three parts - (pure-effectful-pure) - for optimization.

sys(call)      [ ][ ][ ]
                |  |  |
handler       [   ]|[   ]
of `console`    |  |  |
                |  |  |
algorithm    [open]|[close]
                [print]

In Algorithm 2, open close may run on one thread, and print may run on another thread. The drawback is that the handler of console must produce a "different" sys effect. Different instances of the same effect type don't have causality, so the backend is free to parallelize this. The requirement of having two copies of the same type of effect isn't important for "making syscalls on Linux", but some effects cannot be forked (see Forkable mentioned above).

I hope you can understand now why I hope the OS and compiler are integrated to understand concurrency.

Recap

Using notation from Koka (but Koka don't have the concept of "entangled" effects)

if sys means the effect of making syscalls, console means writing to console

<sys> means 1 thread making syscalls (1 logical thread, but the backend may use all CPU cores ) <sys, sys> means 2 threads making syscalls

<sys, console> means 1 thread making syscalls, and 1 thread writing to console <s: sys, Entangled s console> means 1 thread making syscalls and writing to console (sequential to syscalls it made)

Causality via phantom value

This is another way to represent causality explicitly. This is like phantom marker of lifetime in Rust. This is like semaphore, but at language level. This is like atomic ordering (acq, rel), but more macroscopic.

Normally, if function A's output is function B's input, then B happens after A.

However, sometimes we need a function with no input to happen after a function with no output. Or maybe we just want one to happen after another.

The >....> does not mean data.
It only influence causality analysis.

sys ~~~~~~~~~x~~~~~~~~~~~~~~~~~~~~~x~~~~
console ~~~~~|~~~~~~~~~~~~~x~~~~~~~|~~~~
             ^             |       ^
        | open | >---------|---> close
        |      | >...      |           
                    .      ^
                    ..> |print|
             Hello" <-> |     |

This "phantom value" is already present in Idris2 as linear type, but there is no way to stack two effects in parallel automatically.

Causality via order in source code :rofl:

What C uses. If an expression is to the left of another, it happens first (logically).

To run code in parallel, you need to use a library like libpthread of C++ actor framework. This is not "automatic".

While parallelism can be done automatically I'm fairly certain that anything in IO cannot be automatically converted to a concurrent system as there is no way to analyze the independence of effects as you describe.

idris-lang / Idris2

Structured concurrency primitives #2527