Provide more information about Verona's plans for parallelism

I know Verona is designed for concurrency, but I'm trying to understand what Verona's ambitions are regarding parallelism. The only reference I could find to this in the repository was a link to this example: https://github.com/microsoft/verona/blob/master/testsuite/demo/run-pass/fib.verona

This is an interesting example, but I'm not sure if it answers my main question: does Verona intend to tackle high-perfomance data parallelism? For example, supporting operations like parallel iteration to accelerate a simple loop over a flat array of numerical data.

Verona's strategy is apparently to provide one extremely powerful, safe model of concurrency, instead of Rust's approach of hiding unsafe code behind safe interfaces so that things like concurrency can be implemented as a library. Although Verona seems to have many of the same abstractions used by Rust library authors to implement libraries like Rayon, I believe these libraries generally require some unsafe code at their core.

Are Verona's concurrency primitives expected to be sufficient to implement a high-performance data parallelism library like Rayon?

Thanks!

The Verona region model does provide some interesting aliasing information that makes some of these things easier, but they are not sufficient for sub-object parallelism. We are interested in adding a synchronous (probably lexically scoped) fork-join parallel model over arrays, but this has currently not gone any further than a whiteboard.

There is some interesting composition with the region abstraction. For example, if you have a parallel loop iterating over an array of isos (unique pointers to the sentry regions) then the Verona type system guarantees that it is safe to operate on every array element independently. You can do this without anything special by pulling the child regions out of the array (updating the pointers) and passing them off for concurrent execution in a when clause.

One of the open questions (for me, at least) is to what extent that abstraction already provides sufficient for offloading phases of the computation to GPUs or similar accelerators (and some of that will come down to design patterns). The Verona asynchronous model should be very good for hiding latency when offloading things to accelerators.

It sounds like there would be far too much overhead if every element of this array was a unique pointer to a float, for example. But I suppose you could theoretically implement a parallel array internally as something like array[iso array[T]] (I don't know the real syntax, but hopefully this is clear). Then you would tune the size of the inner-most array, depending on the size of T and the cost of each iteration. Is that the sort of thing you have in mind?

Also, although being able to offload to a GPU is great, my understanding is that there are still many parallel tasks best handled by a CPU threadpool, which I assume Verona already uses for processing when clauses. If that's true, perhaps fast parallel iteration could already be prototyped.

If it's a float (or any other primitive type), then it's fairly simple to infer aliasing information. This kind of parallel execution typically becomes more complex if you have complex object graphs that you want to do something parallel with. For anything that works using core-level parallelism on the CPU, we have some fast paths in the runtime that make this kind of pattern very quick using the when syntax. That typically isn't sufficient to get the best performance out of a modern CPU, you also need to use the vector instructions. For parallel array-of-primitives iteration, you can trivially map this to a vector, but with recent ISAs that support vector scatter-gather you can also do this with arrays of pointers if you can prove that the things pointed to by the array elements don't alias. The region abstraction gives this guarantee for free if the array elements are iso (nothing reachable from n[i] can alias anything reachable from n[j] if n is Array[T & iso] and i != j). This makes it a lot easier to perform auto vectorisation.

I am somewhat less interested in autovectorisation than I am in providing tools that allow programmers to deterministically and safely apply vectorisation (see: autovectorization is not a programming model). Autovectorisation works (in very high-level terms with a load of oversimplifications) by running some analyses to determine if computations are independent, then to see if the instruction mix required for them to be in vector registers is likely to be better than the scalar version, and then applying a transform to make them run in adjacent vector lanes. The big problem with this approach is that the analyses typically depend on heuristics in the compiler and vary from release to release, so there's no guarantee that any loop that is vectorised in compiler release N will be vectorised in release N+1. This isn't necessarily worse than any other non-reproduceable compiler output problem, except that vectorisation can give a 4-8x speedup on a critical loop and so that unpredictable compiler behaviour may be the difference between your program being fast and your program being unusable.

Sorry, that reads like a long digression, but there was a point: autovectorisation depends on being able to prove non-aliasing guarantees and on a cost model. If we can express the non-aliasing guarantees in the source-level type system (as Fortran does), then it is always possible for the compiler to vectorise code. Now you are left with the cost model to determine whether you should vectorise. You can always fall back to letting the compiler try to figure it out if you don't care, but if you do then you are in a place to provide explicit control. OpenMP does this for C/C++/Fortran, but without any integration in the type system: #pragma vectorize says 'I, the programmer, assert that nothing in this loop aliases in a way that would prevent vectorisation' and if the programmer is incorrect then exciting undefined behaviour results.

The open question is what is the minimum that we can do in the language to make it possible to safely express all of this in libraries (without needing unsafe code).

Thanks for the detailed response! Is Verona's type system not already able to express a guarantee that two pointers don't alias? Presumably you just need a primitive for splitting up array slices, like fn split_at_index<T>(slice : &mut[T], i : usize) -> (&mut[T], &mut[T]).

Or does the when clause needs full ownership instead of a Rust-style &mut? I suppose that would make it harder to split the array up, because now you are trying to divide up the ownsership of a single allocation.

If we can express the non-aliasing guarantees in the source-level type system (as Fortran does), then it is always possible for the compiler to vectorise code.

Are there not some other requirements too? Presumably if I annotate a loop to say that I want it to be executed with SIMD, and all of the data involved has non-aliasing guarantees, I could still do lots of things that would prevent the use of any SIMD instructions? Looking at Rust's SIMD libraries, it does seem to be a bit tricky even when the aliasing is resolved.

The Verona type system handles ownership at a region level, not an object level. This is how you can have cyclic data structures (or simply DAGs) without requiring unsafe code. Rust handles ownership at an object level, so the only way to express a DAG is to use unsafe (for example, the RC trait).

In Verona, your mut parameters would need some region qualification. If you have an array of primitives, then any disjoint index sets are guaranteed non aliasing. There's some great work by some folks at TJ Watson on adding index sets as a primitive in Java that I'd like to borrow from for this. If you have an array of mut references; however, then they must (by the Verona type system) all refer to objects in the same region. Although two disjoint slices in the array don't alias, the objects that they refer to might. This means that you can't vectorise a loop that operates on such an array, because writes through n[i] may be visible by reads through n[j].

If; however, you have an array of isos, then you have much stronger guarantees. These define references to the sentinel objects in two regions (the sentinel dominates all live objects in the region and is the only object that can have a pointer from outside the region pointing to it). Because regions form a tree structure, anything reachable from n[i] is not reachable (at arbitrary layers of indirection) from n[j][1]. This means that it is safe to run any operation on n[j] in any order (including in parallel / interleaved) with operations on n[i], except for the scheduling of additional when blocks (which may define an ordering of externally visible operations).

The latter property is quite interesting because it is not limited to arrays: any arbitrarily complex data structure that can present iso pointers has this property. An locally_ordered qualifier on function calls that guarantees (and is statically checkable) that they don't schedule any asynchronous work that would make execution order in the current thread visible to other executions may be enough to allow any such loop to be parallelised or vectorised (@sylvanc can probably tell me if I'm missing something) and is something that could be completely inferred within a module. It's not clear to me how much of a burden this would impose on programmers, because now does-not-use-when is a property of the API guarantee and so if a function is modified to do some work asynchronously then the compilation of all callers may need to be modified.

I don't believe that this is something that we'll want to do for version 1 of Verona (unless someone is really enthusiastic about targeting GPUs or vector units early), but it's something that we want to keep in mind so that we don't make it unduly difficult to add later.

[1] More or less. You can have a weak pointer, but this can be turned into a strong pointer only if you can also present the region that contains the object, and this can't be done by following the path. You can have a (weak or strong) pointer to a cown that contains the parent, but this can be used only to schedule additional behaviours via a when clause.

Again, thank you for the thorough, insightful answer. I have a much clearer picture of where things stand at the moment, and which avenues might be worth exploring.

microsoft / verona

Provide more information about Verona's plans for parallelism #82