Parallelism - Githubissues

keean / zenscript

A trait based language that compiles to JavaScript

MIT License

42 stars 7 forks source link

Parallelism #41

Open shelby3 opened 5 years ago

shelby3 commented 5 years ago

Unfortunately this discussion started in Error handling thread.

And the discussion derived from a post I made about C no longer being a low-level language because of its mismatch to the fact that all scaling must come from parallelism.

I hope that discussion will continue here instead.

How can we ever find any of our 1000s of comments of past discussion if they are buried in unrelated threads.

shelby3 commented 5 years ago

@keean wrote:

Actually you have that backwards, there is more information about parallelism available at runtime.This is why VLIW processors failed to outperform x86.

VLIW was based on the idea that the compiler could do more to extract parallelism from the code than the CPU, so they made a highly parallel CPU with many execution units that could run multiple instructions in parallel.

You mentioned this VLIW comparison and then VLSI/VHDL parallelism in our past discussion in the Concurrency thread. And ironically you made the same conflation of “low and high-level parallelism” which I will be correcting you on again herein.

I agree there’s more information at runtime relative to compile-time approach which is parallelizing the same algorithm.

But your thought process is incorrectly applied to what I wrote. You conflated low-level runtime, with high-level compile-time. You must compare apples to apples. Your point is true for low-level runtime versus low-level compile-time.

I did not write “more information”. I wrote “more semantic information”. The distinction apparently means nothing to you but it should.

The “more information” at runtime is limited in its character by the semantics expressed in the code. The higher level semantic information enables (the programmer, not the compiler!) to make changes in the algorithm which are impossible for the low-level speculative pipelining parallelism to achieve, because the low-level transformations cannot change the high-level algorithm. A high-level algorithm designed to achieve parallelism can obtains orders-of-magnitude scale of performance; whereas, the low-level speculative pipelining is only achieving much less than an order-of-magnitude increase and this increase is not scaling by Moore’s law over time because the CPU designers have hit a limitation of nature which can’t scale. Advocating the futility of fighting against nature is inane, idiotic, myopic, etc..

In addition to my slamdunk point above, your point about runtime information can also be applied by a compiler+runtime at a higher semantic level which the low-level pipelining can’t do. For example, with pure functions a map operation can parallelize all the operation for all elements, but the decision about whether it should or not is dependent on the ratio of the overhead of parallelization relative to runtime cost for the operation. So the runtime could measure this ratio and decide whether to parallelize the map. A compiler could estimate it, but for cases near to the margin a runtime could measure it.

Another tangential point is that when you claim that it’s too difficult for programmers to extract parallelism, you are ignore the points made in the ACM article that extant PLs lack the proper designs to extract the parallelism that they could plausibly do without too much programmer intervention, such as the aforementioned pure function example. But my point has been also that ingenuous programmers will find even more clever algorithms for parallelism.

That ACM article argued that C is no longer a low-level language because due to lacking these facilities for the compiler to automatically extract higher-level parallelism from the code which the low-level speculative pipelining cannot extract, then CPU designers are forced to create a virtual microcode pipelining hardware inside the CPU which C does not even expose to the programmer. IOW that parallelism is the only way to continue to scale Moore’s law and C encourages CPU designers to instead not scale and waste most of the transistors on trying to squeeze diminishing incremental improvements in serial performance. That is a recipe for disaster with the future of computing becoming stagnant and technology which relies on computing failing to improve. Thus it behooves us as PL designers to address this issue.

Essentially they stripped the speculative out or order hardware out of the CPU and used that space for extra execution units. This relied on the compiler to produce the parallel instruction.bundles. What they found was it could only outperform the x86 on very specific scientific computation tasks, and failed to be faster for general computation.

Because they were running serial algorithms. No where have I claimed that running serial algorithms on parallel hardware will be faster than running serial algorithms on speculative pipelining. Thus you shouldn’t write this irrelevant information which doesn’t even apply to my point.

My point was about the need to parallelize our performance critical code paths, otherwise our programs will not scale. This is a statement of fact, which none of your obfuscating BS has refuted.

I go with evidence over speculation every time.

Me too I will also choose speculative parallelism because of the evidence. At a higher semantic level. Which is for example how 3-SAT is made practical for some working sets.

And you will go with evidence? Then why do you ignore the evidence that future improvements in serial speedups are dying? And why am I repeating this for the 4th or 5th time?

I simply can’t believe you’re this slow minded because you’ve demonstrated to me in many instances that you have a sharp intellect. Why the cognitive dissonance here? Is it because and do you lack divergent thinking? Is it some obstinance issue? Are you preoccupied? Are you being lazy and not really thinking it out and relying on past conclusions you had made instead of reanalyzing them in light of my points?

keean commented 5 years ago

@shelby3

3-SAT is made practical for some working sets.

And 3-SAT is not general computation, it is a very specific algorithm, for a class of problems, but it looks very different from general computation.

The general part of most programs is a event loop that queues sequential processes, that necessarily follow one another, or are connected by streams in the pipelined case.

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I agree that a new language has to handle parallelism. There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

shelby3 commented 5 years ago

@keean wrote:

The general part of most programs is a event loop that queues sequential[ly ordered] processes, that necessarily follow one another, or are [analogously sequentially ordered by being] connected by streams in the [Unix shell paradigm] pipelined case.

ftfy.

Sequentially ordered != sequential process. There may be parallelism within the individual processes each of which are sequentially ordered. I know you know that nature is fractal, so there can be parallelism within serialism and vice versa.

Also the events in the event loop which are in the performance critical path are typically not sequentially ordered and can be performed in any order or even in parallel in some cases. The GUI events which are ordered are typically not performance critical (we do not need to scale the performance of GUIs by orders-of-magnitude).

The point is, I do not disagree with any of your examples of parallelism, and we don't have to wait for new hardware, we can run these algorithms on the GPU already.

I have not looked at what are the limitations of the designs of existing GPUs. Seems that the ones built into the CPU don’t have the fast GBs of dual-ported main memory that dedicated GPU cards have? And what are the issues with exchanging work between CPU threads and GPU threads? Are the cost barriers high enough that the two processors pretty much can only send messages to each other? I know you’re reasonably knowledgeable about the extant hardware designs, so I will defer to your knowledge on this.

But yeah, maybe our compiler should be doing more to leverage that GPU (and also the CPU’s multicore) hardware which often sits idle.

I agree that a new language has to handle parallelism.

I think my recent point can be recapitulated/reconstituted as the more parallelism we can push for in code, then the more incentive CPU designers will have to prioritize high-level parallelism instead of low-level pipelining. Hopefully resulting in scaling trending back to Moore’s law progress.

My point is we should do all we can and reduce dependence on the low-level speculative pipelining in our design decisions. For example, such as where you argued in the Error Handling thread that exceptions as return values would not be an extra performance cost because of the low-level speculative pipelining. But that design decision if taken would have further cemented the need for low-level speculative pipelining and helped kill future scaling. And upon further analysis of your proposal independent of the concerns with reliance on speculative pipelining, it became clear we didn’t need to accept that design decision.

There are two ways I would do this, one is vectorisation, so high level operations like matrix multiplication can use SSE vector instructions for performance. The other is by 'goroutines' like parallel tasks communicating by streamed channels.

Not just parallelism but also we need to diminish our reliance of the model of main memory as having latency that is low enough to keep up with a serial processor. Because this requires expending the transistor budget on complex deleterious caching mechanisms.

The ACM article also points other things we can improve to reduce for example the reliance on cache coherency, which is necessary in order to enable hardware designers to prioritize parallel computation units instead of the very costly (inefficient!) model of low-latency main memory.

For example that ACM article pointed out that if our PL allows for the compiler+runtime to reformat arrays of structures to structures of arrays, without intervention from the programmer. Also I wrote:

Let me quote what I wrote in my original post:

So the author of the article explained that instead of complex caches (and to accommodate the large working sets that the Transputer didn’t) we perhaps need an architecture where we have so many threads running that the latency to shared memory can be offset by computation these threads have to do in between accesses to main memory such that the CPU is always fully utilized. IOW, while some threads are waiting on main memory, other threads are running computation.

Note AFAIK that most GPUs cards include a lot of shared memory with a very fast, wide bus. So perhaps they’re already close to what is optimal even for large working sets.

However each go-routine is still a sequential algorithm, and fundamentally programs will be built from parallel sequential tasks.

Not always true. Again nature is fractal.

shelby3 commented 5 years ago

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

And the corollary is that expending it to speed up serial algorithms that are in the performance critical path of the application futile non-scaling.

keean commented 5 years ago

@shelby3

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Having 4 cores does not make up for this, just look at the failure of the Power Architecture. PowerPC was IBMs attempt to follow that approach, lots of simpler cores. It was tried in the playstation2, and the added complexity of programming the thing drove developers away, leading to the dominance of the XBox, as it provided a proper PC like CPU with caches and super-scalar-out-of-order processing.

The ideas you and that article are proposing are not new, they have been tried, and they failed the test of the free market.

shelby3 commented 5 years ago

@keean wrote:

Another point which was implied in my writings on this issue which I want to make more explicit, is that expending the transistor budget on low-level speculative pipelining to speed up serial algorithms that are not in the performance critical path of the application is pointless marketing hype.

I disagree with this. Speculative out of order execution is a key feature in exploiting super-scalar architectures. Without it expect software to run 2-3 times slower.

Do you not understand what I intend by “are not in the performance critical path of the application”?

It can’t be slower in any relevant way, if it is not in the performance critical path.

Whoop-de-doo if you make your highly responsive GUI more responsive and nobody can discern the speedup, because it is not code in the performance critical path. Law of diminishing returns aka marginal (non-uniform) utility.

Another corollary is that only 5% (or maybe up to 20%) of your application’s code is typically in the performance critical path, although there are exceptions. Thus the less stellar programmers will still have a lot of work they’re qualified to do that can be serial algorithms.

Btw, one (or maybe 2) core in the CPU could retain that speculative pipelining and the other 8, 16, or 24 (and increasing in the future) could then not waste their transistor budget on marketing hype. Has AMD already designed a CPU like this, such as with most of the transistor budget for the GPU on the same die?

shelby3 commented 5 years ago

Yet another example of parallelism I think the compiler can extract without intervention from the programmer.

keean commented 5 years ago

@shelby3

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

shelby3 commented 5 years ago

@keean wrote:

It can’t be slower in any relevant way, if it is not in the performance critical path.

It's latency that's important, so speed is important, but not throughput. Many of the highly parallel systems have high throughput but also high latency, fine when you are doing a batch job, but bad for an interactive desktop computers where low latency is critical.

I did not propose to employ parallelism in code that is not in the throughput performance critical path. Thus latency would not be impacted in the typically up to 95% of the code of a program that is not in the throughput performance critical path. Since that 95% of the code can only run serially then it can’t leverage the other cores. Thus one or two cores with the speculative pipelining is sufficient (on the desktop and servers will eventually discard the speculative pipelining entirely once we improve the software). The rest of the transistor budget can be allocated to parallelism which is the only way to scale.

You could argue that running multiple copies of the program such as for a webserver where each request runs a separate copy of the program, thus would need every core to have speculative pipelining in order to minimize latency. But I would argue that a 50% reduction in CPU-bounded latency is not worth killing the scaling (by Moore’s Law) of the number of cores and thus number of requests that can be served per CPU. This is why Xeon (which sacrifices clock rate for more cores) instead of desktop Haswell/Broadwell architecture runs on most servers. So Xeon is sacrificing latency for parallelism.

Also Ryzen is kicking ass on Intel precisely because they made the right tradeoff for where computing is headed, “Ryzen CPUs offered stronger multi-threaded performance and weaker single-threaded performance relative to comparable Intel CPUs.”

Sparc was too geared towards parallelism too soon. The software had not caught up yet. AMD is probably hitting the sweet spot for the transition phase. They decided to emphasize their strengths in design and parallelism (GPUs) and outsource the foundry (eventually leveraging the coming strength of China in foundries!). AMD is following Android’s model and Intel is following Apple’s model, so I expect AMD to end up with more market share than Intel unless they fuck up again to snatch defeat from the jaws of victory. Also AMD has rising capital to expend because of the rising demand for GPUs for mining cryptocurrencies. But eventually something like Sparc or GPUs will come back much stronger in the future when the software adapts to more parallelism. And look who continues Sparc now. Japan! The Asians are going to kick ass in the future.

Also the latency is typically not bounded by the CPU but rather bounded by the persistent storage media, e.g. SSDs. So there is sometimes (or maybe even typically?) ample overhead to allow for parallelism and discard the speculative pipelining without increasing latency, although this can vary in each situation. You could attempt to make the argument that latency of the CPU and persistent storage are additive, but smart coders know to mask the CPU latency in the persistent storage latency with interleaving. Ditto the masking the CPU throughput and latency with memory latency via interleaving computation with read/write accesses.

shelby3 commented 5 years ago

I farted:

If for example that is an image filter, then split up the image file into tiles and parallelize the entire filter chain over tiles. If necessary overlap the tiles slightly if the filter input domain exceeds its output domain.

shelby3 commented 5 years ago

Some parallelism discussion in the Concurrency issues thread #17.

sighoya commented 5 years ago

@shelby, Do you know how the ram capacity of a normal PC Board will increase in future? Massive multi core increase for the future is one good thing but to run multiple instances simultaneously wee need also more ram capacity.

keean commented 5 years ago

@shelby3 Threadripper!

https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh

It's still speculative out of order with up to 32 cores.

shelby3 commented 5 years ago

@keean wrote:

It's still speculative out of order with up to 32 cores.

64 hyperthreads on a desktop! Wow.

Yeah I had realized that removing the speculative out-of-order doesn’t gain much. But the point remains that further increases due to Moore’s law are going into cores now. There’s no more that can be squeezed out of OoO exection.

Remember I wrote recently in the Concurrency issues thread #17:

So again I ask you, “Do you know what percentage of the silicon of Intel CPUs are for the speculative execution and register rewriting as contrasted with the caching?”

Apparently only about 10 – 15% of the transistor budget is expended on OoO execution (percentage of TPD power budget may be higher):

https://hothardware.com/news/amd-ryzen-die-shot-cache-structure-architecture-advantages-exposed

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Core_2

https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge_(client)#Core_2

So apparently there will be not much sense in removing for purposes of increasing number of cores.

Although it would still perhaps be beneficial to turn it off when we don’t need it (i.e. in I/O bound concurrency) in order to conserve power consumption (important on mobile and in data centers), and when we need our code to surely free from timing attacks that might be lurking in the nondeterministic out-of-order speculation.

Nevertheless the point remains that transistor budget is increasing faster than transistors can be consumed for extant core counts that would increase performance. IOW, the OoO execution and superscalar features can’t improved. So Intel and AMD are forced to use those transistors for increased core counts.

So more extensive use of concurrency and parallelism is unavoidable.

shelby3 commented 5 years ago

@sighoya I don’t know the plans about DRAM. I presume they’ll scale everything up. It’s the latency which can’t keep pace. But with massive concurrency then we can coalesce main memory accesses as GPUs do.

@keean wrote:

But we need the L3 cache.

Well actually one day in the distant future when the concurrency is massive, then can do coalescing of main memory (and also persistent storage) accesses with concurrency then we won’t even need L3 cache, but that is I think too far into the future to design for now.

For that sort of use C + OpenMP would be a good target.

Found this:

AMD told us you should theoretically be able to build a Threadripper system with up to 1TB of memory when 128GB LR-DIMMs are used—hopefully enough to hold you for a few years. (AMD also says Threadripper should technically be able to support up to 2TB of RAM, although the company hasn’t validated this because there are no DIMMs that support the capacity yet.)

keean commented 5 years ago

@shelby3 Think of it like this, CPUs like threadripper are best when you have threads that need to take different paths, or run different tasks. GPUs are best when all the thread run in lockstep, all taking the same codepath in parallel.

On a CPU you might lauch parallel threads, and then wait for them all to complete (a rendezvous) on a GPU you know all the tasks finish at the same time. So a GPU is going to be more efficient for loop parallelisation (as long as its a big enough chunk of work to overcome the cost of launching the task and collecting the results). A CPU is going to be more efficient at having one thread per service connection dealing with client connections etc.

keean commented 5 years ago

@shelby3 A good parallel language should have different syntactic structures to represent these different kinds of parallelism, so the programmer is clear how things will execute and how much they will cost (the principle of visibility, and not paying for what you don't use). You could automake this with heuristics, but then the programmer has to play chase the heuristic, where small seemingly inconsequential code changes can tip the balance and result in unpredictable changes in performance.

shelby3 commented 5 years ago

@keean made me aware of this:

Microsoft is working on a new paradigm for increased parallelism:

Now Microsoft ports Windows 10, Linux to homegrown CPU

Explicit data graph execution - Wikipedia

Also remember I predicted this earlier in this discussion about the death of speculative superscalar architecture and shared caching between threads:

OpenBSD disables Intel’s hyper-threading over CPU data leak fears

https://www.itwire.com/security/83347-openbsd-chief-fix-for-new-intel-cpu-bug.html

And remember that there is no way to disable timing information (can merely increment a variable in a loop):

https://gruss.cc/files/fantastictimers.pdf

The only 100% safe way to stop code from software timing is to make all code run in a continuously randomly changing speed of clock, which isn’t desirable nor realistic.

OS preemptive threading may to some extent foil the use of timers which rely on incrementing a variable and non-preempted execution, but the devil is in the details. I don’t know the minimum interval needed to measure a timing attack with software timers. And if the OS preempts too frequently then overhead increases, which eats into the performance that was supposed to be gained from those hardware performance features (speculative execution and shared caches) which make the timing attacks possible. Thus I’m not sure that removing hardware timers is a complete solution. Also some software needs high-precision timers. Thus I still propose that in the future we will need the option to have code request that it be run only on cores with that vulnerable performance enhancements turned off and/or to dictate which other code can run along with it in the other hyperthreads which share the same vulnerable performance enhancements. Code that prioritizes security over maximum performance or which employs suitable parallelism may want to turn those vulnerable performance enhancements off.

EDIT: if you want some flavor of why correlation attacks can be so insidious and we should not think we can so easily squash them just by removing a hardware counter.

keean commented 5 years ago

@shelby3 I am not sure, I think for most people performance is more important than security. I mean you can still exploit "rowhammer" on DRAM, but people are still using DRAM :-)

shelby3 commented 5 years ago

As I wrote previously, I think eventually we will end up with a way for applications to turn it off on a per core basis so that applications can decide whether they need to prioritize security (and battery life) over serial performance.

shelby3 commented 5 years ago

I want to make my stance clear because I learned in private messages that some misunderstood my stance. In my post that responded to Eric Raymond’s blog post and claim about the invariant of SICK algorithms requiring serial performance and the discussion here in Github that has ensued, I didn’t intend to imply that we will ameliorate mathematical complexity theory. Rather my stance is that we will replace the need for those algorithms by solving problems with different strategies which can be parallelized and that do not require those algorithms. My stance is that we have no choice and humans necessarily rise to the challenge and are ingenuous when the market demands them to be.

We may also find algorithmic strategies to apply to existing SICK algorithms which leverage parallelism, but that wasn’t my central thesis.

shelby3 commented 5 years ago

@keean can you concur or correct my thought that §4. Challenge #‍1: Traffic on pg.3 of the paper Why On-Chip Cache Coherence is Here to Stay (which also appeared in a journal) contains an error in logic:

We now tackle the concerns regarding the scalability of coherence’s traffic on the on-chip interconnection network. To perform a traffic analysis, we consider for each cache miss how many bytes need to be transferred to obtain and relinquish the given block […] In a coherent system, when a core reads a block that is shared, the coherence protocol may need to forward the request, but to at most one core (and thus the traffic for each read miss is independent of the number of cores). However, when a core incurs a write miss to a block that is cached by one or more other cores, the coherence protocol generates extra messages to invalidate the block from the other cores. These invalidation messages are often used to argue for the non-scalability of cache coherence, because when all cores are sharing a block, a coherent system must send an invalidation message to all other cores. However, our analysis shows that when sharers are tracked exactly, the overall traffic per miss of cache coherence is independent of the number of cores […] Consider the access pattern in which a block is read by all cores and then written by one core. The writer core will indeed be forced to send an invalidation message to all cores, and each core will respond with an acknowledgment message (i.e., a cost of 2N messages for Nsharers). However, such an expensive write operation can occur only after a read miss by each of those cores. More generally, for every write that invalidates N caches, it must have been preceded by N read misses. The traffic cost of a read miss is independent of the number of cores

Although they’re correct that the bus traffic overhead of the invalidation messages is constant relative to the traffic of the read misses, the cost of coherency for per-core private caches is that every write invalidation is by definition going to cause an additional read miss for each core that still needs access to anything (that wasn’t even invalid but was included) in the block (aka cache line) that was invalidated. IOW, the overall increase in traffic cost for per-core caches due to write invalidation, doesn’t scale with the number of cores. Or more simply stated that sharing mutable data between cores that have private caches doesn’t scale.

EDIT: I received an email reply from one of the authors of that paper. He concurred but pointed out that my point is only true when there exists the “unusual situation” of “false sharing” of “unrelated data … collocated on the same cache block, leading to needless coherence traffic.” His opinion is that “good data placement, false sharing shouldn't be a big problem.” I responded that I disagree because there will be cases where the record of related fields being written to also needs to be read again, so it’s not necessarily due to false sharing and thus not necessarily unusual or uncommon. Thus I continue to believe that the discovery by mostly @keean with my follow-up insights is very important. However I will say I don’t have any statistics or functioning models to check the frequency of this issue and its quantitative impact.

I don’t necessarily disagree with the title of the paper though once writable shared data is disallowed in the system. There’s still some coherency required…

If we adopt @keean’s suggestion in the Pointers #38 issues thread that all data shared between threads must be immutable (aka read-only), then my realization is that the only time that per-core caches need to be invalidated is when an immutable data structure has been deallocated and the memory it occupied is allocated anew for some other data. Because the invalid data remaining the cache will not be accessed by any program which respects the allocation until the same area in shared memory has been allocated anew. The allocation could optimize this further (so that other data in the same block which is still valid as cached is not prematurely invalidated) by refusing to allocate anew the memory that the deallocated immutable data structure formerly occupied until all of the surrounding memory which comprises an entire block (regardless whether that entire block is loaded into any per-core cache) that will be invalidated is also deallocated. Thus the invalidation broadcasts (to the per-core caches) would only happen when the entire block in shared memory is no longer in use and has accepted a new allocation.

This does conflate software and hardware though. One way to implement it is to have a system call for invalidating blocks and rely on software to manage this correctly for its virtual memory pages.

The conclusion section of that paper states:

First, this paper does not do detailed architectural simulations with specific benchmarks. We instead show that coherence’s per-miss traffic is independent of the miss pattern and number of cores. Although less precise than detailed simulation, our results are actually more robust as they are not limited to the specific benchmarks studied.

They actually failed to consider the holistic issue, which architectural simulations under scaling would presumably have illuminated. They considered each read miss independently without considering (and thus implicitly presuming not) that write invalidation drives more read misses thus amplifying traffic.

Second, we did not compare against multicore chips without caches or without a shared address space.

They also failed to consider the case of an Erlang-like model wherein all shared data is immutable.

_{P.S. I emailed the authors of that paper a link to this post.}

Additionally immutable shared data can facilitate less complexity for cache-to-cache transfers as explained in 3) Cache-to-Cache Transfers of subsection B. Coherence Protocol Independent Design Considerations in §III. IMPLEMENTATION CONSIDERATIONS on pg. 7 of the research paper An Evaluation of Snoop-Based Cache Coherence Protocols:

^{3) Cache-to-Cache Transfers: To improve the timeliness of requests for data that is cached somewhere, it may be useful to allow another cache to supply the data rather than always going to memory. To support cache-to-cache transfers, each BIU must have a way of indicating whether or not its cache has the data and is able to supply it. This can be accomplished by means of a single wired-OR bus control line, which is asserted by each BIU if its cache is able to supply the data. The memory controller will also know that it should not supply the data if this control line is asserted. If several caches are able to supply a copy of the data, there must either be a way of selecting one of those caches (by using a priority or ownership scheme, for example) or it must be guaranteed that all caches will put the same value on the bus. In [1], it is noted that this additional complexity and the potential for slowing down the processors that have to provide the data from their cache has resulted in a limited use of cache-to-cache transfers.}

^{Another issue with cache-to-cache transfers is whether or not one of the caches has the cache block in question in a Modified state, which can be indicated by another wired-OR bus control signal asserted by the BIU. If one of the caches has a modified copy of the block, it should not only be the cache that is chosen to supply the data, but it should also inhibit memory from supplying the data since memory has a stale copy [1][3]. As the cache is supplying the modified data to the requestor on the system bus, it is possible to write the modified value back to memory at the same time (as opposed to writing it back in a separate bus transaction). While this saves a bus transfer and improves bus performance, it has the additional complexity of having three cooperating members on the bus [1].}

And cache-to-cache transfers provide power-savings as explained in subsection F. Energy Efficiency of Cache Coherence Protocols of §II. SURVEY OF EXISTING CACHE COHERENCE PROTOCOLS on pg. 5 of the same research paper. And also explained in §The case for cache coherency of the article How Cache Coherency Impacts Power, Performance.

Read-only shared data is actually faster and uses less power on existing systems which employ the MESI protocol.

shelby3 commented 5 years ago

This post contains a shocking (at least to me) discovery that Java and Go multi-threading with global heap, tracing GC is entirely incompatible with scaling multi-core. Scaling multi-core is entirely incompatible with multi-threading that shares mutable data between caches. Additionally since tracing GC is incompatible with scaling multi-threading unless each (or small group of) threads has its own separately collected heap, because as I explained that collector which isn’t instantaneous can’t be shared between too many threads because it would pause too many threads and make threads dependent on the good allocation behavior of other threads. So scaling multi-core will require reference counting for immutable objects shared between caches (and the immutability will aid in probing and collecting acyclic references, while not stalling any threads during such cycles detection and collection).

The GC of objects which aren’t shared between caches must be congruent with a viable general purpose programming model. Intel’s Xeon Phi (aka Larrabee) suffered from not sufficiently addressing these requirements². One option is Rust’s zero overhead 100% stack allocation with its lifetime and exclusive borrowing model providing safe reentrancy. Code would be grouped to be run only by hardware threads sharing the same cache (to avoid write-back and cache-to-cache transfer coherency overhead). Another option is similarly to restrict each logical grouping of code to hardware threads sharing the same cache, but employ a separate GC’ed heap for each said grouping. For example, each green thread could have its own separate tracing GC’ed heap. Note this means a M plurality of work stealing green threads per core that has N = # of hyperthreads in M:N threading. However, this latter option lacks Rust’s zero overhead abstraction compile-time provably safe thread reentrancy; thus suffering the race condition (e.g. live and deadlock) errors and runtime overhead of locking. (EDIT: but this deficiency is solved with Actors.)

Additionally for cache-to-cache transfer efficiency and so that lightweight (e.g. green) thread context switches could occur to mask latency of such transfers, then ideally the programming model would enable the thread scheduler to know via software that such a transfer is required.

Serial core performance can’t scale and many-core doesn’t scale with complex cache coherence. At least it doesn’t scale if writable shared data is employed by the software. And if all the software will be modified to not use writable shared data, then the complex cache coherence circuitry is wasted silicon. In that case perhaps a less complex or even software driven coherence could be employed instead. If all cache-to-cache transfers are only needed when passing an Erlang-style, Actor model message from a thread on one cache to a thread on another cache, then the software can tell the hardware the source and destination caches for the transfer. (Note this means a plurality of Actors per core running on top of M work stealing green threads per core that has N = # of hyperthreads in M:N threading). Remember per the prior post that cache-to-cache transfers are needed so as to avoid (and if not read-only or was newly created, then also avoid the write-back to and) the load from main memory which is expensive both in terms of latency and power consumption.

The Rigel design offers some rough estimate of the improvement in computation density (e.g. FLOPS/mm² of silicon) by removing the:

order-of-order speculative execution
deep pipelines
complex cache coherence
cache-to-cache transfers
SSE vectorized SIMD
virtual paging, virtualization, etc (that a GPU doesn’t have either)

It was modeled on a 45 nm process. It has 1024 RISC cores (with two-wide issue superscalar ILP), 15 MB total SRAM cache, operating at 1.2 Ghz with a 100W TDP. Compared to the Intel i7 Lynnfield on a 45 nm process which achieves only 4 CISC cores with 8 MB total SRAM cache operating at 2.8 Ghz with a 95W TDP.

Ignoring latency differences and considering only throughput density¹, I doubt the Lynnfield in 2009 had the 180 instructions in flight simultaneously that the current modern Intel processors has. For one thing, Lynnfield didn’t even have hyperthreading. Let’s presume maybe it had 40 in flight and the Rigel only 2. So unless 1024 ÷ (40 ÷ 2) ÷ 4 = 13 RISC instructions are required to match the work achieved by every CISC instruction, then the Rigel may have a higher compute density than the Lynnfield. Also the in flight instructions on the Intel also include some wasted computation that will be discarded by the speculation.

The Rigel paper discusses some of the other variants² of parallelism throughput density designs which have been attempted.

One of the key distinctions¹ is between the GPU vectorized (aka SIMD) model which is able to mask main memory latency with a massive number of cheap context-switch threads (affording an order-of-magnitude higher memory bandwidth by trading off increased latency) and the non-vectorized (aka MIMD) general purpose programming model which prioritize serialized low-latency and serialized performance at the expense of lower computation density and higher power consumption per work completed (i.e. lower overall efficiency). The latter has an order-of-magnitude lower main memory bandwidth with lower but still high latency (c.f. also) requiring large L3 cache to bring average latency down sufficiently for the serialized performance. As explained in the Rigel paper, the former is not suitable as a fully generalized programming model .

The following is an idea for a variant of a Rigel-esque design (yet with software driven cache-to-cache transfers and cache coherence) that replaces the L3 (and L4) cache with a solution that has comparable latency by masking most of the latency of main memory reads. Lightweight green threads are a suitable paradigm for masking macro-scale latency¹, but they aren’t efficiently preemptive enough to generally mask main memory latency. But if main memory latency will only significantly occur (i.e. except for cache spill over eviction for non-shared objects, which should be minimized by the optimizing programmer) on accessing reference counted shared objects (i.e. shared between caches) which probably mainly comprise reading those reference counted shared objects which are cache-to-cache transfers (perhaps with software driven cache coherency driven by Erlang-style, Actor model message passing). The compiler is enabled to insert a green thread check-point for preemption by the memory controller (which ameliorates the lack of efficiency). Contrary to the incorrect association of green thread context switches to kernel thread/scheduler context switches, green thread context switches are low latency. At a function boundary, they’re only slightly more than the ~30 cycles latency of a jmp instruction because only the SP register has to be saved+restored (given that the other registers and flags are invalidated at the boundary. But in this case the preemption would not be at a function call boundary, so all the registers and flags have to stored and restored, which may double the latency. The memory latency is ~100ns (on CPUs and ~400ns on the CPU) which on a 2.6 Ghz CPU is 260 cycles. But this really needs hardware support that the software can query, so in case the main memory being accessed is already cached (which the software wouldn’t know) then the unnecessary context switch would be avoided. The extra latency for storing and restoring the other registers and flags could sometimes be eliminated by prefetching to the cache at the start of the function containing the the main memory accesses. No registers nor flags are yet in use at the start of a function.

Hardware-based cache coherency (which is required without this software directed coherency we’re contemplating) obfuscates the true costs of the hardware form of coherency.

¹ _{Sufficient threaded parallelism can mask latency so that hardware is not idled. The GPU achieves this with a huge register file (RF) and putting 32 threads in a contiguously context switched warp, so context switching has low overhead (and also reduces the control circuitry needed per execution unit). The modern i7 hyperthreading employs dedicated registers for the hyperthreads which share execution pipeline ports and run simultaneously. Thus hyperthreads although masking some latency are also intertwined with a focus on serial performance. Green thread work stealing which can run on modern CPUs without hardware support employs compiler generated checkpointed pre-emption (not the same as full pre-emption) which could probably be made more efficient with dedicated hardware support. Green thread context-switches are reasonably efficient compared to OS threads, and especially where context switched at function boundaries, but only suitable for masking software macro-scale latency blocked operations, which does not include low-latency main memory accesses. They may be able to also mask efficiently the latency of waiting on a mutex lock to be freed without hardware assistance.}

_{The GPU trades off latency in every facet for increased parallelism by eliminating synchronization assumptions for caches and global memory, as well as for example reducing the power consumption (also) by designing a much higher 24 cycle write latency (archived) (c.f. also) on the register file (RF) than the registers for each core on a GPU. The RF (and shared memory) is shared by all threads in the block so a thread context switch is very low overhead because unlike the CPU no registers needed to be saved.}

_{The GPU’s shared memory also has a much higher latency than L1 cache on the CPU and doesn’t serve the same function as a CPU cache. Also the shared memory is banked (where each 16 or 32 successive words are successive banks) so that if threads in the same block read a different location from the same bank on the same clock cycle then they will run serially instead of in parallel. All 32 threads of a warp must be unblocked in order for the warp to execute, but even if they are unblocked they aren’t guaranteed to execute uninterrupted. Note, given 32 threads attempting to uniformly distributed randomly read from 32 banked shared memory, probability math informs us that 63% of the threads would run in parallel every cycle. Thus in such a divergent, non-vectorized access pattern would require four (4) times 24 cycles latency for each step forward of the warp (thread block) that requires all 32 threads to access the shared memory, because (0.37 ^ 4) × 32 < 1 so that less than one (1) thread is blocked for each such step (since all threads in the warp must be unblocked in order for the warp to execute). This a reason the GPU is not suitable for general purpose programming.}

_{So on the GPU need to mask latency by either running many more than 8 threads per SM (i.e. “More concurrent threads”) and/or serially queuing up in the RF many copies of each step of an algorithm, before processing the next step for each of those copies (i.e. “More independent instructions (ILP) within a thread”).}

_{There’s no cache coherency on the GPU. Even the GPU which has L1 and L2 cache of main memory (actually L1 may optionally be configured only for caching local memory which is what the RF spills over into), requires a threadfence() for global coherency of writes to main memory. Even communication between warps requires a shuffle paradigm.}

² _{@keean had mentioned to me the Xeon Phi as being an example of multi-core that failed to be as performant per watt as a GPU and failed to match the modern Intel CPUs on serial performance. But that was based on Larrabee which was originally intended to compete with a GPU yet also remain more generally programmable. The Xeon Phi has hardware coherent cache and out-of-order execution! So it’s not surprising that it hasn’t really solved any major need. It isn’t optimally designed for maximum general purpose programming parallelism (c.f. the Actor proposal in the next post) nor for vector, SIMD programming. And of course isn’t optimal for serial algorithms either. Thus it excels at nothing.}

shelby3 commented 5 years ago

Actors as the paradigm for parallelism

_{^{(note having very bad whole body inflammatory attack today with a severe headache, forehead on the keyboard fatigue, mental confusion, so please excuse any lapses in my elocution and logic below)}}

@keean and I had discussed in 2016 Actors for example the Concurrency issues thread #17.

@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)¹.

Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.

@keean pointed out why not just make every UTXO record an Actor. I initially thought it would increase the overhead, but as we discussed the details we both ended up realizing there would be no overhead. I contributed the idea that even a hashmap could be modeled with Actors by making each bucket an Actor which contains for example a linked-list of values. So after deriving the bucket address from the key, the message could be sent to the Actor at that address to check for the value in the bucket or add/remove a value to/from the bucket. For example, if the pointer to the linked-list for each bucket is stored in an array and every Actor for those buckets employs the same code for processing an incoming message, then there’s no extra per-bucket storage needed for the Actor. Simply call the Actor’s function with input arguments consisting of the message pointer and the aforementioned bucket address. The objects pointed to for the message and the linked-list are of course immutable. The Actor modifies the linked-list by creating a new copy with the required modifications and then changes the pointer at the bucket address of aforementioned array.

Note when messages will cause the Actor to have side-effects, then the message must be guarded by a mutex lock to prevent Actor function reentrancy. Thus each Actor really should have pure and impure variant of its function. The impure variant blocks on the mutex lock. These mutex locks can be 1 bit for each Actor when the Actors are in an array (because these adjacent Actors allow for a bit field for these mutex lock bits). But note this mutex lock would be required anyway for mutating any shared memory data with multiple threads, so the mutex bit isn’t additional overhead due to Actors.

@keean pointed out that there’s no need to queue the messages and instead employ the mutex lock for blocking. He also pointed out that no need to store a stack pointer (SP) for each Actor because there’s no context-switching going on at the Actor abstraction and each Actor should return to the top-level of its stack when returning from its message processing function. The code sending the message (which may be inside of another Actor’s message processing function) calls the Actor’s function and thus there’s no context-switch required.

In order to simulate a message queue (so that the calling Actor can complete instead of blocking), have the called (impure) Actor queue it and return. The called Actor must have registered itself to be called periodically by a thread pool (or registered as callback such as of another Actor) to process its queue.

The other key insight I had is that these Actors can run atop of green threads. So for example when blocked on the message mutex lock, the green thread can context-switch. @keean initially had the thought that green threads don’t preempt, but I explained (c.f. footnote ¹ in my prior post) they can be made to software “preempt” (aka cooperatively preempt) at key check-points where they can block and Actor message passing calls would be one of those check-points.

If the called Actor is only executable by threads running on another core (or group of cores if more than one core per ^{_{software-driven-coherency}} cache), then green threading can context-switch from the calling Actor (to another Actor on the same core / cache group). Except if the return type of the called Actor is () (aka in C is void) (^{_{presuming that failed delivery of sent message is an exception although note that message delivery is not assured in the Actor model yet that may not prevent an exception on failed or timeout yet the timeout will not be definitive in that it could have been delivered}}) then the calling Actor can continue immediately (presuming delivery is guaranteed) while the sent message is queued by the green threading on the called Actor’s core / cache group. Actors shouldn’t presume any semantics based only on the timing of the return of calling (i.e. sending a message to) another Actor because Actors are supposed to have autonomous state machines so that they can tolerate complex dominoes failure due to indeterminacy over the timing of delivery or failed delivery of message.

Note although we’re describing the model of how Actors work locally on the same CPU, the message passing Actor model also scales across the bus and the network. For example, the UTXO records could be split up across servers in a cluster.

¹ _{There’s no panacea paradigm which will prevent all cases of dead and live locks, i.e. the solution to the dining philosophers problem requires omniscience. Neither the Actor model, immutability, nor Rust’s exclusive mutable borrowing will eliminate all opportunities for dead or live locks. For example, the issuer of a message could be waiting on a response from the recipient of the message in order to complete some operation, yet the recipient of the message could (even if transitively) end up waiting on the issuer before it can complete its operation and reply to the issuer.}

In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis² for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counted object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

² _{Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.}

Someone commented and I replied:

that all sounds very Erlang-ish, but I guess they are missing your type system constrains

Erlang has process-level context switches, which is very preemptive (although apparently also cooperative) but very heavy weight overhead. Thus, can’t eliminate the L3 cache for scaling multi-core as I proposed in the prior post. Also AFAIK Erlang doesn’t guarantee the queue-less message passing that @keean and I devised which is really critical to lowering the overhead of Actors to zero as I explained and also critical for implementing software cache coherency as I explained in the prior post. Also Erlang does not have the stack allocation I proposed and instead has automated generational GC which I have pointed out is incompatible with multi-core scaling. Yes Erlang and Actors are the inspiration, but we have proposed significant innovations and yes also the Erlang PL is lacking our ideas for a type system and other features we’d like to have for a complete PL such as modules. Also I think I with Zer0 will likely first target Go as the compile target and VM. Although in this case the reference counted objects (as declared in the code) would actually be garbage collected by Go’s GC on the Go compile-target. The important point is that Zer0 can be compiled in the future to more performant code and will support a multi-core scalable future.

Note by targeting Go initially and cooperative preemption (instead of Erlang BEAM VM’s process-level preemption which is apparently also partially cooperative), we could as compared to Erlang (c.f. also) better integrate SIMD and other assembly-level optimizations without corrupting the runtime as NIFs can do in Erlang. We’d still need to compile cooperative preemption into the SIMD code (c.f. footnote ¹ in my prior post).

The Actor-like design contemplated in this post would rectify the following complaints about Erlang:

^{One thing Erlang isn't really good at: processing big blocks of data.}

^{He means things like decoding mpeg data. There is too much numerical computation which Erlang is not optimized for. If the processing just involves shuffling big blocks of data from one place to another, then Erlang is quite good at it. (Files to TPC Sockets, etc)}

^{You can't update shared blocks of data (there are no pointers in Erlang) and hence data must be shuttled across processes which in turn translate to inefficiencies.}

shelby3 commented 5 years ago

In response the Quora question What programming languages do software engineers use in 2018?, this answer is more evidence to think that the proposed Actor model in this thread is the future:

When Salesforce bought Mulesoft for $6.5 billion, every programmer should have sat up and taken notice. This event adds to other recent API-related developments, especially from IBM, that clearly signal that the new programming paradigm is based on APIs and microservices.

Websites are transitioning from standalone collections of web pages to dynamic microservices hosted in the cloud. This simply means that what used to be a website is now a collection of independent programs running in the cloud. These independent programs “talk” with each other and with other services in the cloud via APIs.

For example, the veterinarian where I take my dog has a traditional static website with all the usual information about hours, personnel, specializations, etc. There are links to other veterinary websites where I can create an account and buy dog food, medications, toys, etc.

Now, with the proliferation of APIs, this architecture is transformed: the formerly static website interfaces directly with other websites using APIs, so the partitioning among websites becomes transparent. Soon I will go to my veterinarian’s “website” and all services and products will be available in one place using a single domain name and a single customer account. There will no longer be hyperlinking among individual, standalone websites (at least insofar as the visitor can tell). This integration of the WWW is happening really, really fast.

sighoya commented 5 years ago

This might interests you.

shelby3 commented 5 years ago

@sighoya ty. Very succinct blog. Good choice. Yeah I was thinking what @keean and I were contemplating is sort of a mix of CSP and Actors, because we are thinking to block on sending and guarantee message delivery. Yet we were also talking about addresses and not channels to make it more efficient.

That is an excellent point that blog reminded me that one of the key aspects of Actors (which I had also mentioned recently and in the past to @keean) is that they don’t guarantee message delivery nor order of delivery. So the Actor model forces to design for network failures.

So we’ll need traditional Actors as well as our proposed optimized form for optimizing local resources.

shelby3 commented 5 years ago

I wrote:

In addition to this amazing insight above, I also realized when writing my prior post that compile-time, provably safe, zero overhead allocation will not require the onerous Rust lifetime analysis² for Actor function code, because these always return (to the top level of their stack, thus total deallocation of all non-shared objects) and per our model shared data will always be reference counted.

So a simple model is any object which the code takes the address of and passes the pointer in a function call or as a return value, forces that object to be allocated on the bump pointer young object heap. All the other objects get allocated on the stack. The young object heap can be entirely collected by simply resetting the bump pointer when the Actor function returns (to the top level of their stack)! Voila! Amazing paradigm-shift! Note we should also try to minimize the number of duplicate objects allocated on the bump pointer heap.

AFAICT, although a very simply concept, this last insight of mine is a paradigm-shift (a technological tour de force) of epic proportions in terms of impact! The performance of Rust and C++ without the clusterfucked complexity.

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack of the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller.

² _{Which is especially undesirable not just because of the tsuris but because Rust can’t model all (c.f. also) forms of lifetimes and thus is inflexible and incomplete.}

@keean brought to my attention Ada’s new research on static memory management which he claims doesn’t require “annotating all the ‘lifetimes’” as Rust requires.

This new research is layered on top of their SPARK language. SPARK is a strict subset of Ada that was designed to have unambiguous semantics and that aimed at formal verification from the start.

However, consider the limitations of SPARK:

SPARK removes some features of the Ada language such as recursion, dynamic memory allocation, access types⁹ [i.e. pointers], dynamic dispatching and generics. It also imposes certain restrictions on the constructs of the language (e.g. array dimensions are always defined by previously declared type(s) or subtype(s)¹⁰).

On top of the language restrictions, it adds annotations that allow for the specification of data-flow, creation of abstract functions and abstract datatypes and to the use of structured assertions (loop invariants are available but package¹¹ invariants are not). There is also a clear distinction between procedures (which may have side-effects) and functions (which are pure/free of side-effects and also have to return a value).

So SPARK has no generics. Rust’s lifetime parameters would be significantly less often annotated if Rust didn’t allow type parameters.

This new research adds pointers to SPARK by enforcing strict control of aliasing.

But Ada’s access pointers are (as I discussed in the past) already severely restricted such that they’re probably as or more cumbersome than Rust’s lifetimes. The research paper also mentioned this:

As part of its strong type checking, Ada also prevents dangling references to objects on the stack or the heap, by providing automatic compile-time checking of “accessibility” levels, which reflect the lifetimes of stack and heap objects.

Conversions between pointer types are restricted to ensure pointers never outlive the objects they designate. Values of a pointer type are by default initialized to null to prevent use of uninitialized pointers, and run-time checks verify that a null pointer is never dereferenced.

Since Ada doesn’t accurately track lifetimes, it can’t support destructors:

Standard Ada supports safe use of pointers (“access types” in Ada) via strong type checking, but safety is guaranteed only for programs where there is no explicit deallocation of pointed-to objects – explicit deallocation is considered “unchecked” programming in Ada, meaning that the programmer is responsible for ensuring that the deallocation is not performed prematurely. Ada can provide automatic reclamation of the entire memory pool associated with a particular pointer type when the pointer type goes out of scope, but it does not automatically reclaim storage prior to that point. It is possible for a user to implement abstract data types that do some amount of automatic deallocation at the object level, but this requires additional programming, and typically has certain limitations.

So although this Ada proposal reduces some lifetime annotations, it still has the same as Rust inability to allow all safe cases², and has additional limitations which Rust doesn’t have.

And worse the Ada proposal doesn’t even allow pointers to objects on the stack!

AFAICT, Ada is really an incredibly inflexible “What Color is Your Function” clusterfuck.

I much prefer my idea I cited at the start of this post which attains static memory safety with 100% flexibility, 100% fully feature, with 0% tsuris, and doesn’t even require a formal verification! How about that for a paradigm shift! And if we still need Rust’s lifetime analysis in some highly critical code paths, then let’s consider my idea to make the annotations less noisy. But frankly if we’re going to add optional lifetimes for functions that we want to be extra performant (so we can prove non-aliasing to the compiler), I think we should devise strong proving which can prove cases that are safe which Rust can’t². However, in that case I really think we should just trust the programmer to write those critical performance paths correctly and label as non-aliased (e.g. the restrict keyword in C) those which the programmer is aren’t aliased (although generally proving non-aliasing at compile-time is a total order although there are scenarios where it can be determined locally, e.g. locally allocated distinct objects), unless we need formal verification for some sort of absolute guarantee of correctness.

EDIT: I’m not claiming the Ada research has no utility. The STARK subset provides for formal verification. So within that STARK language, the additional feature of pointers to objects on a the heap is an improvement. I’m just arguing that is not the ideal paradigm for a general purpose PL wherein productivity and ease-of-programming is a higher priority than formal verification.

shelby3 commented 5 years ago

I wrote:

Note thus musn’t store any pointer in any reference counted object, that points into the stack or the bump pointer heap. And musn’t store any pointer in any object on the stack or the bump pointer heap, that pointers into any reference counted object that can have its reference count decremented by that Actor before that Actor function returns. For example, a reference counter object whose reference is created by a function would decrement the reference count when the function returns unless that reference is returned to the caller of the function.

The only reasons I can think of for why automatically reference counted (ARC) objects would be employed inside Actor code:

For references that aren’t persisted after the Actor returns. These are objects which are either sent in a message call to another Actor and/or which have destructors (i.e. that execute if the reference count is zero).
For references that persist after the Actor returns. These are either as persistent state owned by the Actor or objects returned to the caller of the Actor.

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns. The compiler will enforce the invariant by checking that objects with this annotation can only be assigned to references which are not re-assignable (aka const in JavaScript but not in C++) including those returned from functions that return this annotated type. And the compiler will check that every such annotated reference which was not an input argument must be returned by that function. If the function is at the top-level of the stack then no such checks are required because the reference can’t go out of scope and decrement the reference count until the Actor returns (thus such annotated reference may also be stored in persistent Actor data).

Such annotated types can then be freely assigned to non-ARC references. This mostly ameliorates the bifurcation (analogous to “What Color is Your Function?”) problem from the two kinds of references due to segregated GC schemes for shared and non-shared objects (that has been proposed in this thread).

EDIT: Unlike Pony, our non-ARC objects can’t be shared. They must be copied to an (said annotated) ARC object in order be shared with another Actor (i.e. sent in a message). We could also adapt some of Pony’s principles on sharing objects. In that Pony allows iso and trn objects which can be locally mutated before shared between Actors/threads. ~~That may break the software cache coherency I invented up-thread for massively scaling multicore. So we probably don’t want to copy that flexibility in Pony which actually makes it more abstruse.~~ See the chart and discussion I wrote in other thread.

Data race protection (which should not be conflated with protection against all forms of semantic dead and live lock faults) in both Pony and my idea is due to the fact that the Actor is not reentrant (i.e. runs only single-threaded) and each Actor only runs only one thread at a time and never can more than one Actor can write to same shared object (not just not simultaneously, rather not ever except for the exclusive local writes of iso and trn to non-shared objects before sharing them). Pony also has exception safety but in a non-ideal purist form which Zer0 will improve.

sighoya commented 5 years ago

My idea is to have a type annotation (for reference counted objects) which indicates that a reference to that object persists after the Actor returns.

Why creating an extra type for it? A type says only something about how its values are stored.

Rather I would use keywords just like const, mutable to restrict access of some pointers.

sighoya commented 5 years ago

@shelby3

I'm wondering why you need to lock shared memory at all with actors.

If different (mutating) actors wan't to mutate a share state, why not wrapping the shared state in a (main) actor itself. Every mutating actor have to send a request for mutation of some address with a timestamp over a specific channel which cannot be used by the other mutating actors. The main actor tries to process cyclically the input channels for all mutating actor requests. If more than one request is available for the same address of the shared state, e.g. the same index of an array, then the main actor decides by time stamp which applies at first, if two timestamp are equal then some configurable policy can apply.

shelby3 commented 5 years ago

@sighoya, they (e.g. re-assignability, mutability, and my proposal herein) are annotations (i.e. attributes) on a type. That makes them a distinct type— i.e. otherwise compatible types with incompatible attributes are thus incompatible types. IOW, these attributes have to be factored into unification of types.

I'm wondering why you need to lock shared memory at all with actors.

We don’t and haven’t proposed needing to do so. Apparently something I wrote up-thread confused you. Please refresh this page, as I had made some extensive clarifying edits to some of the up-thread posts. I suppose you’re referring to the following quoted text from the Actors as the paradigm for parallelism up-thread post?

^{@keean and I were recently discussing in private messaging, the read-only sharing via messages for Actors and before that I was initially negative about Actors because they aren’t compatible with a shared memory model exposed in PLs and they don’t inherently resolve all concurrency conflicts (but neither does any paradigm)¹.}

^{Analogous to comments made by others when comparing JVM to Erlang’s VM, I was thinking it’s necessary to have shared memory to for example support a global UTXO hashmap where keys are the SHA256 transaction ids and the values the (pointers to the) a UTXO (i.e. an unspent transaction output) record. There’s a related Q & A How does shared memory vs message passing handle large data structures?. There’s also an explanation of the paradigmatic inferiority of shared memory versus message passing. Message passing also amortizes better (c.f. also) over the long-term and higher scaling. Erlang has proven to be excellent for creating web servers for example.}

^{@keean pointed out why not just make every UTXO record an Actor […]}

The above what was pointing that I thought I needed shared memory but my private discussion with @keean had revealed that Actors subsume the need for shared memory. I explained the reasons why …

Note that the proposed Actor model I’m describing in this thread will work on top of either shared global memory or message passing. I have described how to also implement it on a hardware that employs cooperation with the software for cache-to-cache transfers and to mask latency that we would otherwise need L3 (and L4) cache for. Note it can also run on top of Go, but less efficiently than if we compile directly to the hardware (or perhaps LLVM). But Go as an interim compile target would be expedient and also gain us some momentum as well the GopherJS JavaScript (and WASM) comes for free with Go.

shelby3 commented 5 years ago

Some discussion of this Actor proposal in the context of the Mutable Objects issues thread #32.

Also discussed in the Iteration versus enumeration issues thread #16.

shelby3 commented 5 years ago

The Actor model is perhaps the antidote to dependencies on side-effects order? (c.f. the discussion of purity at the linked comment)

shelby3 commented 5 years ago

@keean tried to implicitly claim credit for the idea I thought of in this thread.

shelby3 commented 5 years ago

Rich Hickey, Clojure’s designer in his explanation of his reasons for not employing the Actor model for same-process state management in Clojure, seems to have not realized that the same code could be recompiled separately for the same-process and distributed cases without incurring performance penalties to the former when the Actor message send is a pure function call as I explained.

I do agree with him that when modeling for potentially high latency delays and timeouts (i.e. failure) resiliency needed in the distributed network case, the same-process shared memory would lose some flexibility and/or incur some (semantic and/or performance) overhead. Yet in some cases this could perhaps be parametrised.

shelby3 commented 5 years ago

@keean wrote in the other thread:

You clearly were the first person to explicitly mention the memory management implications of the actor approach here in these issue comments. I didn't really have time to do a detailed analysis at the time which is why I didn't directly reply.

Maybe I don't fully understand what you are proposing, but I think that the lifetime of mutables being tied to that of actors is kind of obvious,

Maybe obvious in hindsight but has anyone proposed replacing Rust’s lifetimes debacle with such an effortless paradigm as what I have proposed?

It’s not enough to just have a stack. My insight requires the compiler’s escape analysis which moves the escaped objects to the bump pointer heap (analogous to a young generation GC but without the write-barrier and mark-and-sweep overhead). The bump pointer heap in my insight is cleaned up (i.e. recover all the free space) with one assembly instruction instead of some complex mark-and-sweep pause overhead as in typical AMM GC collectors.

So my insight¹ is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I think the reason it was more obvious to me is because of the extensive study I did recently on the types of and designs for garbage collection.

however immutable data, if it can still contain pointers, will mean that there will still be data on the local thread stack that can get shared (although we could force a deep copy, this would be slow, and may not terminate).

Indeed. I explained the solution for that in my up-thread reply to @sighoya.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

Also actors will replace objects, and therefore the management of actor lifetimes would become the same problem as the management of object lifetimes in other languages. In the actor model you can usually pass references to actors, which is kind of like passing mutable data.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

¹ _{Essentially in the abstract, it is another application of my concept of “chunking” in an entirely different context of what “chunking” means compared to my use of that abstract concept in the Subtyping issues thread #8. Herein the “chunking” is one bump pointer heap for each actor. We could say that “chunking” in the abstract means divide-and-conquer or divide-and-isolate.}

keean commented 5 years ago

@shelby3

So my insight1 is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I would not say that this simple thing is your idea, to me your idea is the combination of bump-pointer-heap, and RC, when you want to move objects to the heap etc. Your idea is the synthesised whole mechanism, not the individual components which may have been talked about elsewhere. If you deconstruct any idea enough it is usually composed of existing ideas.

RAII in C++ uses the fact that object/actor property lifetime is tied to that of the object/actor so that when an object is destroyed all its properties are destroyed. Of course being C++ there is no guarantee that this will happen, if any of the properties are objects they will of course have to have a properly working destructor.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

I would point out that by making values first-class you can prevent this from happening without solving the halting problem. We simply make "owning pointers" have to be a DAG. This limitation would hold for all values, and would be implicit if values had referential transparency.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

In the actor model there are no closures, and you cannot pass closures in messages, (because we cannot serialise functions, and there are no references to functions or closures). Instead higher order programming is achieved by passing actor references. If we cannot pass actor references then we cannot write higher order programs directly, although we can use the "typeclass trick" I posted about where you send a datatype, and that datatype selects the implementation via a typeclass.

The problem I currently have is that typeclasses represent a global abstract algebra, something we both have said was desirable in the past. Actors are the opposite of this, and I am not sure the two conceptually work well together. With typeclasses we want both objects and types to be in typeclasses, and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

Coming back to some of my other points, Actors are a kind of container, they are not 'Values'. To make actor interfaces work everywhere, everything would need to be an actor. That can make sense for mutable objects (things that have a location) like "{3}" could be an actor. But it does not make sense for values like "3". This effectively means having two parallel "interface" systems, something like type-classes for values to allow operator overloading and an algebra on values, and a separate interface system for actors which could also be type-classes (with effects?).

shelby3 commented 5 years ago

@keean wrote:

So my insight¹ is actually ostensibly not entirely obvious; and I’ve seen no one make that observation. Have you?

I would not say that this simple thing is your idea, to me your idea is the combination of bump-pointer-heap, and RC, when you want to move objects to the heap etc.

My proposed bump-pointer-heap is quite different than the use of a bump-pointer-heap for young generation objects in a mark-and-sweep GC. In that there’s a distinct bump-pointer-heap for each Actor and it’s completely deallocated when the Actor function/procedure completes after processing each message. Thus, there’s no write-barrier nor mark-and-sweep compacting overhead whatsoever.

The ARC (automatic reference counting) is for the globally shared objects. So it’s not connected in any way to the bump-pointer-heap.

Your idea is the synthesised whole mechanism, not the individual components which may have been talked about elsewhere. If you deconstruct any idea enough it is usually composed of existing ideas.

True, except there’s no bump-pointer heap that I know of that didn’t have the integration overhead of being part of a mark-and-sweep GC scheme.

RAII in C++ uses the fact that object/actor property lifetime is tied to that of the object/actor so that when an object is destroyed all its properties are destroyed.

That is per object and per stack frame. Thus it does not provide the holistic freedom of cross-referencing across stack frames and between all types of objects and references. RAII bifurcates the program into non-RAII and RAII.

I agree that if you squint your eyes and tilt your head sideways, you might see some conceptual resemblance to my idea. But I could see some conceptual resemblance between a rabbit and an elephant. They both have large ears.

You’re astute to raise the point that the Copy typeclass must always terminate. The programmer has to make it so. You’re correct to imply there’s no general way for the compiler to analyse this invariant because of the Halting problem.

I would point out that by making values first-class you can prevent this from happening without solving the halting problem. We simply make "owning pointers" have to be a DAG. This limitation would hold for all values, and would be implicit if values had referential transparency.

In general we can’t insure that DAG invariant.

In the proposal I described in this thread, shared data (between actors and thus between threads on different cores in some future software coherency massively multicore hardware model) will always be immutable and reference counted (ARC). That is an orthogonal issue to the lifetimes of the objects on stack and bump pointer heap for each actor’s function/procedure. The latter are what benefit from my insight of an instant GC cleanup (just reset the heap allocation bump pointer to the origin when the actor function/procedure returns).

In the actor model there are no closures, and you cannot pass closures in messages, (because we cannot serialise functions, and there are no references to functions or closures).

That is a restriction only on the message communication between actors. We can have closures instead of the actor’s function/procedure.

Instead higher order programming is achieved by passing actor references.

That is one way. But it is also still allowed to have any programming paradigms we want inside the actor’s function/procedure. Remember my model up-thread is not of millions of actor nodes with no cache and no shared memory. Rather my model for scaling multi-core described up-thread is groups of cores sharing a L1 cache. So we don’t need to entirely code everything in actors.

If we cannot pass actor references then we cannot write higher order programs directly,

Disagree for the reasons stated above. However, I do agree your point might apply when modeling data structures with actors.

I don’t understand your point though. Why can’t we pass around actor references in messages?

although we can use the "typeclass trick" I posted about where you send a datatype, and that datatype selects the implementation via a typeclass.

Where did you describe that trick? I am not familiar with what you are describing. I don’t understand.

The problem I currently have is that typeclasses represent a global abstract algebra, something we both have said was desirable in the past. Actors are the opposite of this, and I am not sure the two conceptually work well together. With typeclasses we want both objects and types to be in typeclasses,

I see no problem. Typeclasses with work fine within actor functions/procedures.

and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

I don’t want to control effects with your idea of typeclasses. I want to eliminate effects by making actors function independent of ordering of messages. For other types of effects, we need some model that works at the actor level.

Coming back to some of my other points, Actors are a kind of container, they are not 'Values'. To make actor interfaces work everywhere, everything would need to be an actor.

I don’t want this level of pervasive integration of Actors. I think of Actors as an API, not a first-class language entity.

shelby3 commented 5 years ago

I wrote:

and we want to control effects with an effects system. With actors only the actors implement interfaces, and side effects are implicit in the actors.

I don’t want to control effects with your idea of typeclasses. I want to eliminate effects by making actors function independent of ordering of messages. For other types of effects, we need some model that works at the actor level.

By effects we mean management of a state machine.

What I am saying is that the state machines of Actors should be independent from each other. They should function correctly regardless of the state machines outside.

So this means abandoning determinism and accepting statistical progress.

Determinism is the wrong assumption. We can’t move forward with computing until we abandon it.

In fact, my decentralization solution to the centralization problem of blockchains which nobody else has solved, is basically at the generative essence about abandoning determinism.

So I think the entire concept of algebraic effects, purity, and monads is far too low-level to deal with the problem they are actually purporting to deal with.

keean commented 5 years ago

@shelby3 state us juts one kind of effect. There are others like whether code allocates memory, whether it is interruptible, whether it does IO. The point with effects is to control which are permitted. An example might be where we want to guarantee the machine cannot run out of memory in the middle of a process, we allocate all of the working space in advance, and then require no memory allocation during the critical process itself. Even within IO we might break down the effects to local IO and network IO, so that we might restrict code from accessing the network.

You can think of these a bit like the security permissions on iOS and Android, we can see from the top level program which permissions are needed, so that for example we might not want to install a password storage program that contacts the network, and conversely we can impose restrictions on the code we call.

The alternative is a capability system. With effects the authority is ambient - that is everything is permitted unless it is specifically disallowed. By default you can do IO etc. In a capability system you cannot do anything impure at all without a 'capability' to do it. Capabilities are passed downward in code, so when your program starts it might be passed the capabilities to allocate memory and read a particular file. The objects would need to be passed downward as arguments in your program to where they are needed.

So in a way capabilities are the analogue of modules and effects the analogue of type-classes.

sighoya commented 5 years ago

@keean

Can't we implement effect orthogonal to the type system? I mean, if you want that for security permissions only, why not append properties to functions stating their effects which are checked in the requires section of other functions which take the said function.

I don't see any reason to pass effects as implicit argument, this makes only sense if you overwrite effects, but for what gain? The ability to override effects for a function destroy the meaning of the said function.

shelby3 commented 5 years ago

@keean wrote:

State us just one kind of effect. There are others like whether code allocates memory, whether it is interruptible, whether it does IO. The point with effects is to control which are permitted. An example might be where we want to guarantee the machine cannot run out of memory in the middle of a process, we allocate all of the working space in advance, and then require no memory allocation during the critical process itself.

I think those effects will become more and more useless anyway.

We are headed into a massively networked and parallelized, distributed execution model. So there will be very little that can be guaranteed. Programs instead need to be written to be resilient and non-deterministic with statistical correctness instead.

Besides, I am not creating a specialized programming language for a deterministic embedded system. I am creating a general purpose programming language for the wave of the future.

I have suddenly had the realization that this effects stuff is really much drama about the mostly irrelevant.

You can think of these a bit like the security permissions on iOS and Android, we can see from the top level program which permissions are needed, so that for example we might not want to install a password storage program that contacts the network, and conversely we can impose restrictions on the code we call.

AFAICT, security permissions are a different issue entirely.

keean commented 5 years ago

@shelby3

AFAICT, security permissions are a different issue entirely.

I don't think so. Access to a file is equivalent to permission to access that file. Most programming languages are full of 'ambient authority' that is the state hidden elsewhere determines if you can do X or Y. These are precisely the kind of effects you can model with an effect system.

Like I said capabilities and effects are two ways to do the same thing:

With effects:

f() requires FileRead("/usr/Bilbo")
   x := open("/usr/Bilbo")
   print(x)

With capabilities

f(y : FileReadPermission("/usr/Bilbo"))
   x = open(y, "/usr/Bilbo")
   print(x)

Note with capabilities we always need to pass down the permissions, all the way from the top, main would be passed the total set of permissions available to the application. I don't think this needs any language features.

With effects we can infer the effects for a function from the function definition, the same we we can infer typeclass requirements.

I think with mobile devices, and security, these kind of permissions are going to become more important, and effects, like type-classes help keep function argument lists simple.

shelby3 commented 5 years ago

@keean wrote:

I don't think so. Access to a file is equivalent to permission to access that file.

You entirely missed my point. Which is that effects will end up being indeterminate and non-deterministic and thus they aren’t something that is expressible with a permission flag because they can’t be captured by any determinism as we move into this brave new world.

In contrast, security permissions are inherently deterministic by definition.

keean commented 5 years ago

@shelby3

In contrast, security permissions are inherently deterministic by definition.

And so they should be. There is no grey area if you want to protect privileleged information. If I do not grant an app permission to access my address book, then it should not be able to under any circumstances, I certainly don't want it non deterministic, where it might give access if the application is lucky.

shelby3 commented 5 years ago

A summary in another thread, of the historical development of the “actor epiphany” that was introduced in this thread.

shelby3 commented 5 years ago

I wrote up-thread:

EDIT: Unlike Pony, our non-ARC objects can’t be shared. They must be copied to an (said annotated) ARC object in order be shared with another Actor (i.e. sent in a message). We could also adapt some of Pony’s principles on sharing objects. In that Pony allows iso and trn objects which can be locally mutated before shared between Actors/threads. ~~That may break the software cache coherency I invented up-thread for massively scaling multicore. So we probably don’t want to copy that flexibility in Pony which actually makes it more abstruse.~~ See the chart and discussion I wrote in other thread.

Data race protection (which should not be conflated with protection against all forms of semantic dead and live lock faults) in both Pony and my idea is due to the fact that the Actor is not reentrant (i.e. runs only single-threaded) and each Actor only runs only one thread at a time and never can more than one Actor can write to same shared object (not just not simultaneously, rather not ever except for the exclusive local writes of iso and trn to non-shared objects before sharing them). Pony also has exception safety but in a non-ideal purist form which Zer0 will improve.

Zer0 shall adopt Pony’s paradigm for our ARC references. I summarized this in the WD40 issues thread #35:

Whereas, if i am incorrect and can add the iso and trn for Zer0’s ARC objects, then for those ARC objects Zer0 would be similarly abstruse as Pony and less performant (because ARC is less performant than tracing), but for the non-ARC objects Zer0 be less abstruse and more performant. Presuming most of the computation occurs in Actor function lifetimes with non-ARC objects, then Zer0 will be less abstruse and more performant.

Note Pony adds iso, trn and tag pointer types to the ones we had contemplated so far.

Pony Zer0 Access Shared Aliasing Non-Shared Aliasing

iso
Isolated Exclusive read-write
(writable) None None

val
Value immutable Immutable
(non-writable) Read-only Read-only

ref
Reference read-write Read-write
(writable) None Read-write

box
Box read-only Read-only
(non-writable) None
Read-only Read-write
Read-only

trn
Transition Exclusive write-only
(writable) None Read-only

write-only Write-only
(writable) None Read-write

tag
Tag Address-only
Opaque pointer Address-only Address-only

The iso and trn can be moved (aka alias burying) to any other above type (except not trn to iso) because they’re exclusive ownership of writing. Also these types above can be employed within a recover block. Record fields also combine with the record pointer type to give a combined capability type. There’s also subtyping of capabilities, but remember in Zer0 all subsumption is via casting so as to make type inference decidable.

There’s no utility for Zer0 to have such exclusive writable types on non-ARC objects, because (the non-ARC don’t live beyond the Actor function lifetime thus) these can’t be shared without copying them to ARC (and copying of course can be to any type). Zer0 could adopt the iso and trn on ARC objects. Yet the exclusive reading of aspect of iso for immutable ARC objects could possibly prove useful to eliminate the need to trace those for cyclic references because only one pointer could exist.

Pony	Zer0	Access	Shared Aliasing	Non-Shared Aliasing
`iso` Isolated		Exclusive read-write (writable)	None	None
`val` Value	immutable	Immutable (non-writable)	Read-only	Read-only
`ref` Reference	read-write	Read-write (writable)	None	Read-write
`box` Box	read-only	Read-only (non-writable)	None Read-only	Read-write Read-only
`trn` Transition		Exclusive write-only (writable)	None	Read-only
	write-only	Write-only (writable)	None	Read-write
`tag` Tag		Address-only Opaque pointer	Address-only	Address-only

shelby3 commented 5 years ago

Extending Pony’s Reference Types

I propose to rename and extend Pony’s reference capabilities for Zer0 as follows.

Pony	Zer0	Access	Shared Aliasing	Non-Shared Aliasing
`iso` Isolated	`[r/w]`	Exclusive (read/write) (writable)	None	None
`val` Value	`val`	Immutable (non-writable)	Read-only	Read-only
`ref` Reference	`r/w`	Read/write (writable)	None	Read-write
	`[r]/w`	(Exclusive read)/write (writable)	None	Write-only
	`r/[w]`	Read/(exclusive write) (writable)	None	Read-only
	`[read]`	Exclusive read-only (non-writable)	None	Write-only
`box` Box	`read`	Read-only (non-writable)	None Read-only	Read/write¹ Read-only
`trn` Transition	`[write]`	Exclusive write-only (writable)	None	Read-only
	`write`	Write-only (writable)	None	Read/write
`tag` Tag	`id`	Address-only Opaque pointer	Address-only	Address-only

Note exclusive (read and/or write) capability is useful for being consumed (in addition to being consumable for other reasons such as sending between concurrent threads) for supersumption. For example supersuming the union element type (i.e. to add constituent union members, e.g. from Int to Int | String) of a collection. It’s safe to write to the resultant supertype (e.g. Int | String) because no reference can read the consumed source subtype (e.g. Int) that would otherwise be violated by writes (e.g. of String) to the supertype.

The dual subsumption case (i.e. removing constituent union member(s), e.g. from Int | String to Int) is handled (without consumption) by assigning a type with (even non-exclusive) write capability to the subtype with the write capability intact but discarding for the subtype reference any read capability (which would otherwise enable reading from the source supertype the violating the subsumed invariant). There’s no need to consume the source of the assignment because no new types were added to the subtype (to be written to) which would otherwise violate reading from the source type.

Pony doesn’t allow sharing as box (i.e. read-only between actors) a mutable that can only be written by one ”actor”. Although it wouldn’t be data race unsafe, it would vacate the optimization of employing CPU registers for values of read-only capabilities because the single-threaded execution assumption would be vacated. Also it would be a debugging nightmare substitute for event handlers.

¹ _{When a box alias was obtained from a trn, then the trn can be consumed as a val for sharing (c.f. also).}

shelby3 commented 5 years ago

More discussion of the importance of the Actor paradigm-shift and what it really is.