keean / zenscript

A trait based language that compiles to JavaScript
MIT License
42 stars 7 forks source link

Pointers #38

Open keean opened 6 years ago

keean commented 6 years ago

Memory may best be modelled as an one-dimensional affine space over a module, see:

http://www.m-hikari.com/imf-password2009/29-32-2009/ostrowskiIMF29-32-2009.pdf

points represent memory addresses, and vectors represent offsets. We only want the operations of affine-addition and affine-subtraction for addresses.

shelby3 commented 6 years ago

@keean explained this to me very well in a private message:

Consider an array in memory. If X points to the first element, and Y points to the last element, then Y - X is a distance, in this case the length of the array, it is not a pointer. Dereferencing (Y - X) will get you a segfault quickly. Pointers are an affine space over a module because I can find the mid-point of an array with (X + (Y - X) / 2) this is fundamental to all binary search algorithms. It makes no sense to add two pointers (X + Y) is meaningless. We can only add a 'distance' to a pointer. In but we can take the affine sum of two pointers, which is basically (n * X + m * Y) / (n + m) this is the generalisation of the binary chop that allows addressing any value in the array, but nothing outside. So we can subtract pointers and Affine-Sum them, this makes them an affine space. However and affine space is Real (that is included rational and irrational numbers) so we need a 'module' which in mathematics is basically what a computer word is, that is a "ring" of integers modulo some number, like (x : Int) mod 256 is how a byte is represented mathematically (I'm sure you know this as it comes up in cryptography a lot). So a computer pointer is an affine space over a ring. It's not a vector because adding pointers does not make sense.

Then pointers are not vectors. Vector subtraction yield a vector not a distance. So this I think indicates that enabling pointer subtraction breaks the vector abstraction.

So references are not vectors because we can not subtract them and pointers are not vectors because when we subtract them we don't get another vector.

Yes pointers do not form a vector space, they form an affine space, and are therefore "Torsors". See: https://ro-che.info/articles/2013-01-08-torsors

Yeah good point that many algorithms are enabled because an array is bounded. But in a Turing machine pointers are not limited to a bounded tape.

So I think you should clarify that you modeled a bounded array, not a theoretical Turing machine.

So in the unbounded case, it’s still an affine space but not over a module/modulus.

IOW, you can’t divide by n+m. You can only model that n+m must equal 1.

It's about "meaning" of pointers. Memory consists of a heap of extents. Pointers inside extents have meaning, pointers outside extents point to unused or freed heap memory, adding the address of one heap object to the address of another heap object is meaningless. For the heap manager (GC) the whole heap is one array.

But you acknowledge that in a Turing machine there is no extent to memory space. It is unbounded. I agree your model applies to arrays but not to a Turing machine.

Pointers have a different meaning in a Turing machine.

A Turing machine is not type-safe.

Nothing is type safe.

Type safety is always an approximation of reality, because reality is unbounded.

You can generate a random number, and then dereference the pointer... But it's likely it points outside the address range of the program and will generate a segfault.

In general a pointer points to an object, and arithmetic is only valid inside the object.

The program has no knowledge of how objects are placed in memory.

A pointer as originally conceived by C can point at anything, even freed memory.

Your model is for a managed language.

Even in a Turing machine the program itself has no knowledge of memory organisation.

The memory model is in the programmers head.

It's like in the old machine code days, you had variables in zero page, stack working down from the top of memory, program code in the middle somewhere.

I think your discovery explains that pointers have less coherent meaning if they’re not restricted to bounds. Free form pointers such as in C are incoherent.

So I subtract two stack addresses and I am likely to get an address in zero page, but the variable it points to has no relation to the things you subtract.

As you said, in C, the memory model is only in the programmer’s mind.

Agreed incoherence.

Yes.

It’s interesting how operations on an abelian group which are coherent inside a module are incoherent from the external perspective. It’s like the modulus is the worm hole through black hole out to the another order (object).

Quantum entanglement at a distance can be perhaps be modeled by that concept of randomness being a perspective of the modulus.

Possibly this could provide the impetus for a theory for why what seems random at some scale may be ordered into a cycle at another scale (a la Martin Armstrong’s claims about how fitting cyclical correlation models to data over longer time frames can reveal hidden cyclical order).


@anonymous wrote:

I don't even understand affine space and I looked it up on wikipedia.

If you applied maybe 30 minutes to think about it, you would get it.

p + K(a - p) + (1 - K)(b - p)

Expand that:

p - Kp - (1 - K)p + Ka + (1 - K)b

Obviously -K + 1 - K = 1, then the result is Ka + (1 - K)b

So the point is that since -K + 1 - K = 1, then p never matters where p is the origin that the vectors a and b are being added to the origin.

So @keean showed that pointers into the array (i.e. indices which could otherwise be negative or extend beyond the end of the array) form an affine space w.r.t. to the array regardless of the origin of the addresses of memory space. To adhere to that affine space then he divides by n + m when weighted summing pointers so that addition is confined to the ring modulo the length of the array. Meaning that if X and Y are pointers into the array, then the affine addition will not create a result outside the extents of the origin nor end of the array, even though the origin and length of the array not known in the equation for the affine addition. You can see this is true from the above derivation (noting that K = n / (n+m) and 1 - K = m / (n+m)) because the greatest and least values of the result are bounded by X and Y.

shelby3 commented 5 years ago

Thinking about how to incorporate into Zer0, Go’s pointers and for Go that a struct is allowed fields which are pointers or stored-in-place (which is not exactly the same concept as boxed versus unboxed).

So the up-thread discussion pointed out that pointer arithmetic and willy-nilly pointers (that can point any where) don’t obey any coherent model.

Thus I’m thinking we need Go’s pointers (which do not support pointer arithmetic) but that we shouldn’t allow an interior pointer to a field of a structure (as Go does). AFAIR, this restriction will also make any novel GC ideas more viable. The main reason to allow pointers to struct fields is for treating stored-in-place fields as objects separate from the struct as a whole, which seems to be an incoherent model.

I do like reusing the . operator for struct field access both on stored-in-place and pointers.

I’m thinking that any data type which has unbounded type variables (i.e. the polymorphism domain isn’t enumerated) must be boxed thus it must use a pointer. But note that typeclass bounded type parameters aren’t unbounded.

I’m thinking that field order in memory layout shouldn’t be guaranteed (which is the case for Go), so that a future compiler can optimize such as for cache optimized modulus alignments and even interleaving of an array of struct implemented by compiler as struct of arrays of fields as mentioned in that article about why C isn’t a low-level language anymore. Yet also for I/O we will need a built-in operator or function for serializing struct to compacted and ordered layout (also with big or little endian conversion).

Any other thoughts on issues?

keean commented 5 years ago

My thoughts are that arrays should be a language primitive, and modelled as an affine space.

We can write a garbage collector by getting an array of bytes from the operating system, and we can do byte pointer arithmetic (affine addition) within this space.

It should not be possible to do pointer arithmetic on struct fields, so a pointer to a struct field would have a length of one.

So we have two kinds of pointer, one has an associated length, and cannot have any arithmetic done on it (an owning pointer), the other has no length, has affine arithmetic, but must have an associated context (refer to an owned object).

shelby3 commented 5 years ago

So struct is layered on top of an affine array model? Thus I am correct we should not allow pointers into fields of a struct (nor elements of an array)?

keean commented 5 years ago

We can allow pointers into struct fields. We can convert a pointer with length, plus a pointer without length into the first object, into a new pointer with length.

shelby3 commented 5 years ago

Your edits are nearly incomprehensible to me. You are writing in some private language that is thinking in terms of the abstraction of m+n = 1 for an affine space. Anyway, I think perhaps what you intend is my thought as follows?

I am thinking that pointers to fields are incoherent because they know nothing about the fact that they point an affine space that is larger than the data type they think they are pointing to. The correct model for a field pointer is a tuple of the pointer to the struct plus a field key.

keean commented 5 years ago

@shelby3 OR a pointer into an object, so for example if we have an object with start address and length x := 0x100 + 0x10 then we can have a pointer like y := 0x104 for example providing we know it is associated with object x. For a struct the type signature tells us which are valid addresses and what the contents are. So for a 'struct' we would need the type information of the struct, plus the start address to validate a pointer.

The point is the following 'C' should be valid:

struct tx {
   int a;
   float b;
};
struct tx x = {2, 3.0};
y = &x.b;
printf("%f", *y);

y is clearly a pointer to a float, so this should be allowed, however we should not be allowed to do arithmetic on this because the object it points to is not affine.

shelby3 commented 5 years ago

I was contemplating if I was reticent only because of vaguely remembering a post I made about some issue with field references and the myopia of the Rust lifetime checker. But that is an orthogonal issue.

But then I remember that the interior pointers issue definitely impacts the efficiency of all forms of GC, including reference counting…

y is clearly a pointer to a float, so this should be allowed, however we should not be allowed to do arithmetic on this because the object it points to is not affine.

My point is that the type of y is float and thus the compiler knows nothing about that y points into a object of type struct tx. If instead y knows it is accessing field b via a type struct tx, then it doesn’t need an interior pointer to struct tx.

These interior pointers (as mentioned is the case for Go) for example reduce the performance and/or storage efficiency of reference counting and other forms of GC (depending on which scheme is chosen1), which is part of my proposal for radically improving GC by combining a generational young GC with explicit annotation of reference counted objects that intend to live longer.

I’m thinking better instead of a pointer to a field, then keep a tuple of the pointer to the struct and a field identifier, but this would require all pointers to incur this storage overhead else we’d end up with two incompatible worlds of pointer types. Yet I’m undecided (see below).

1 Interior pointers regardless whether employing reference counting, generational GC, or mark-and-sweep GC, either incur a search cost for finding the base of the object or the storage (including cache locality) cost for a double pointer (or pointer and offset), one for pointing to the interior field of the object and another pointing to the allocated base of the object (where the reference count is also stored in the reference counting case). In the case of the double pointer (or pointer and offset), even non-interior pointers would incur this extra overhead else we’d end up with two incompatible worlds of pointer types. I wrote in my post about my cordoned nursery proposal:

The AMM of “pointer aliasing” to prevent “use after free” for our cordoned nursery proposal would also require double-sized pointers to point both the containing object head and the field of the object pointed to.

And another the variant on that in the reference counting case is to keep also the reference counts with the (double or single) pointer and divvy up the reference counts when copying the pointer, then only access the dereferenced common reference count when deallocating pointers thus adding back up to the total divvied reference count to signal deallocation of the allocated block. Also I’m hoping that it will be possible to prove in some cases that a reference counted pointer could not be deallocated during the lifetime of a copy of a pointer, thus the copy of the pointer doesn’t need to manipulate the reference count, e.g. in the case that a copy of a reference counted pointer is input to a function which does not store that copy.

shelby3 commented 5 years ago

If array and string slices are the main benefit of interior pointers, then instead of paying the penalty on all pointers to support them, why not just employ the pointer + offset for those types which offer slicing? Seems much more economical.

Go’s GC apparently employs the hashmap method of finding the base of the object, which is a global cost for all pointers. Of course if we are compiling to Go, then we will incur that penalty on all pointers anyway, but it’s not my intention that Zer0 should forever be hamstrung by Go. Go is just an efficient stepping stone for getting Zer0 to production state asap.

keean commented 5 years ago

@shelby3 pointer + offset addressing is slower, and uses more registers. It is not just slower, bit slower on every operation, so taking an in-place quicksort as an example, it can result in a significant reduction in performance. This kind of thing will result in people wanting to write critical sections (filter kernels etc) in 'C'. Looking at the requirements to support parallelization, which is predominant in numerical computation, then we should aim to achieve something equivalent to 'C' or Fortran for kernels, so they can be parallelized on the GPU.

shelby3 commented 5 years ago

@keean thanks for your feedback:

pointer + offset addressing is slower, and uses more registers. It is not just slower, bit slower on every operation, so taking an in-place quicksort as an example, it can result in a significant reduction in performance. This kind of thing will result in people wanting to write critical sections (filter kernels etc) in 'C'.

As I wrote previously, the pointer + offset is only necessary for not having horrible performance with pointers to fields when the base pointer of the allocated object needs to be accessed. So that doesn’t need to be pointer + offset, unless we need to compress the shortage requirements (i.e. if offsets use much less space than pointers), because can just make that a tuple of (pointer-to-allocated-base, pointer-to-field).

Additionally, your performance criticism would not apply (even if employing pointer + offset) in a tight loop that was leveraging the ability to combine pointer + offset into a single-wide (i.e. no double pointers) temporary pointer.

Before reading your comment and immediately upon awakening, I had the following realization1.

Where we can prove at compile-time that the lifetime of an interior pointer is contained within the lifetime of any other pointer to the base of the allocated object, then such interior pointers don’t need to be involved in the (referencing counting, generational, or mark-and-sweep) GC. Thus we can employ single-wide pointers for them, making them very fast. So presuming we will not be using double-wide pointers (such as for reference counted allocations) which would otherwise have negative impact cache locality, I propose we only allow interior pointers at the language level in the cases where they can be single pointers. For other cases, employ pointer + offset and/or pointer + pointer at the library level, such as for slices in the library implementations of arrays and strings and what not.

Thus I’m thinking that high-performance (performance critical path) libraries will require some lifetime analysis (or at least escape) analysis. Most (~80+%) of the code of the program doesn’t require this level of attention to detail and absolute maximum performance.

Separately there’s the possibility that eventually the typical way to implement a generational GC2 is a separate young generation collector for each thread that can run simultaneously. So thus if escape analysis proves that an interior pointer (and if we’ve implemented my generational + RC idea, then the interior pointer must not point within a reference counted object) doesn’t leak to the caller and that function completes fast enough to not need to be interrupted by the generational GC (which perhaps is an annotation the programmer is allowed to make), then the interior pointer needs only a single pointer.

EDIT#2: I’m contemplating that with an Actor-like model all stack allocation will be safe and thus single-pointer pointers to fields for stack allocated objects will always be allowed.

I also wrote on Quora:

Another reason which I’m sure you aware of but didn’t mention is cache locality. Imagine an array of pairs of objects where each pair object has two pointers (references) for each of the pair instead of storing them in place. So walking the array is no longer contiguous in the cache lines so it will thrash the cache and add latency.

Pointers to fields forces any automated GC to be slower. The alternatives to pointers to fields are copying or pointer+offset:

Pointers · Issue #38 · keean/zenscript

1 But I am late to post it because of some distractionary drama with the n00bs.

2 Which Go apparently currently lacks and instead employs escape analysis for the young objects on the stack which is inferior given it’s not the complete Rust-like lifetime analysis so presumably many objects leak into the old generation mark-and-sweep collector.


EDIT: @keean was writing to me at LinkedIn about how if the future is parallelism with “a modern Intel processor has up to 180 instructions in flight at a time” to saturate it (and his thought that non-speculative CPUs will need 1440 non-superscalar3, non-speculative pipeling cores to perform on the same level as a 32 core AMD Threadripper) so we need memory to perform efficiently with many many threads. The point being that mark-and-sweep GC definitely does not work well with many cores. AMD Threadripper will have 32 cores with 64 hyperthreads and the GPU has even more.

I’m glad to see that @keean was also thinking about the GC issue with threads. Indeed the young generation compacting collector must typically have a separate instance for each thread. I wrote at Quora:

Páll Haraldsson wrote:

Another reason to not invest in Java is that we headed to massive parallelism and Java’s GC model is inherently incompatible.

What is it about the “GC model”? For Julia language wit GC it works very well. Your link about language that compiles to JS.

Must pause all the threads when compacting the young generation bump pointer, compacting collector, unless employ a separate young generation collector for each thread, which thus requires these objects to non-shared between threads. There is no such feature on the JVM to create such thread-cordoned objects. The young generation collector needs to be collected more frequently than the older generation(s). So imagine you have one thread that is slamming the young generational heap with allocations thus forcing all other 9999 of your threads to pause much more frequently.

Additionally automated GC (instead of Rust-like stack allocation) appears to be entirely incompatible with scaling of multi-core![1] Yikes!

[1] https://github.com/keean/zenscript/issues/41#issuecomment-407587313, c.f. footnote 2 on the linked post

This is I think why Go apparently doesn’t have a young generation collector as I already stated above, because the M:N goroutines apparently have no way of dictating which groups of goroutines must not run simultaneously so they could run on the same young generation collector (and the use of the stack instead of young generation collection although maybe less efficient is thus able to accommodate M:N without any such groupings). The older generation (even incremental) mark-and-sweep collector which has one instance for all threads suffers from Amdahl’s law as the more simultaneously running threads, then the greater percentage of the performance budget that is lost to GC.

This is another reason I had offered my unique proposal to have a separation between programmer annotated young generation collected objects and annotated referencing counted collected objects. Referencing counting is more scalable for threading than mark-and-sweep, because the collections can be more localized. But not always as there can also be cascading dominoes of effects when reference count declines to 0 on an object which contains pointers to other objects, especially so if young generation objects escape from escape analysis and are allocated to the older generation collector instead of allocated on the stack. And cycles probing for reference counting also increases non-localized costs. These non-localized cascades can increase the synchronization overhead between threads given a global memory pool for the older generation objects.

The competitive alternative to having a separate young generation GC for each grouping of green threads which can’t run simultaneously, is the tsuris and inflexibility of Rust-like lifetime analysis which presumably enables more objects to be allocated to the stack than simplified escape analysis which Go does. This could potentially enable more threads to be eligible to run simultaneously since it is a more fine-grained proof of non-sharing of mutable references. Whereas, the separation of young generation collectors by grouping is a somewhat analogous but different proof of non-simultaneous sharing of mutable references. There are tradeoffs between the two paradigms.

3 Also superscalar is an independent instruction-level parallelism optimization from, out-of-order execution, pipelining and speculative pipelining of branches. Superscalar relies on independent, parallelizable instructions on multiple execution units. Whereas, instructions can be serially dependent in pipelining because the instructions are partially overlapped such that each instruction is executed in multiple steps, thus the result of an earlier instruction can complete before it needs to be used by a later instruction which is already in the pipeline and has partially executed. Thus pipelining by definition does not accommodate branches without speculative execution. Superscalar is not pipelining. Superscalar would not leak timing info nor waste power, and can achieve up to 8 instructions in parallel but typically only 2. So maybe only need 180 – 360 superscalar threads to match a fully speculative pipelining 32 core AMD ThreadRipper, although this superscalar-like instruction-level parallelism feature seems to be inherent in the “the SM more closely resembles an eight-wide vector processor operating on 32-wide vectors” vectorization on the GPU so we would need on the order of ≈1440 GPU threads to match the AMD Threadripper. Also there’s some limited pipelining (aka arithmetic pipeline) on the Streaming Multiprocessors (SM) of the GPU but not speculative (presumably because branching is highly undesirable if it will cause divergence between threads in the same warp because threads in warps are context-switched contiguously) but there’s no out-of-order execution on the SMs of the GPU so there’s no register renaming. Instead the GPU exploits a much larger register file (RF) so that registers don’t have to saved and reloaded on context switches (and only one pointer/offset/warp-id into the RF for the registers context-switched for the warp not individually per thread because warps are context-switched contiguously) so that a much higher-level of algorithmic-level parallelism can be exploited to mask the relatively high latency of all forms memory on the GPU, e.g. 24 cycle write latency (archived) (c.f. also) for the registers as compared to less than 1 cycle on cores of a modern CPU.

@keean often cites the Itanium as an example of explicit (i.e. compile-time determined) ILP that failed. But he is incorrect to conclude that the superscalar and non-speculative pipelining aspects of the Itanium totally failed. There was some parallelism gains, just not the 45 instructions per core in flight obtained on the modern Intel processors. The fact that the GPU exploits ILP in the SMs without any out-of-order speculation, is evidence that superscalar is not necessary tied to speculative and out-of-order.

This link won’t archive so I’m copying the content below:

There is context switching, any time a warp scheduler switches from one thread to another. This thread-switching behavior is desirable as it is part of the latency-hiding process.

A large part of the a thread's state is contained in its associated registers (there is only a small amount of additional state, such as the instruction pointer, stack pointer, and flags). A thread's register state does not need to be moved during a context switch because there is enough register storage for all threads/warps that are currently resident and schedulable for execution. This is one of the reasons why GPUs have relatively large register files (e.g. 16K+ registers). Since there is a unique (physical/HW) register for each logical register for each (schedulable) thread, no movement of register data is required during a "context switch".

This is also why register usage can be a limiter to "occupancy". In general, we want as many warps/threadblocks as possible to be "resident", i.e. schedulable for execution, as this increases our ability to hide latency. But each additional schedulable warp carries with it a register "footprint" which must be provided for in the SM registers. When we run out of register storage, then no more warps/threadblocks can be scheduled on that SM, until some of the register usage is released by retiring warps/threadblocks. Therefore imposing limits on register usage per thread may help to drive up "occupancy", i.e. the number of schedulable warps/threadblocks that can be simultaneously resident on the SM. Since this helps with latency hiding, this may be a good thing. On the other hand, limiting register usage may lead to register "spilling" by the compiler, which will increase memory traffic. So there may be a tradeoff to this method of increasing "occupancy".

As a clarification to the above, warps cannot be individually made resident on a SM. The unit of residency on an SM is the threadblock, not the warp. No warps from a threadblock will become resident on an SM until there are sufficient resources for the entire threadblock. At that point, the entire threadblock (all of its warps) may become resident on that SM.

keean commented 5 years ago

@shelby3 I think it is generally true that my these kind of high performance kernels (should) do no allocation or deallocation.

This probably lends itself to stack rather than heap use. I think Ada had a reasonable compromise, which was to keep the owning pointer to a heap object on the stack, and then allow local "fast" pointers. Then you just need an escape analysis to make sure those local pointers do not escape. The owning pointer is like a C++ 'unique' pointer, and the local pointer is like a plain '*' pointer.

I also think reference counting works better with parallel systems, this is like a 'C++' shared pointer.

This kind of makes me think that Apple's approach with 'Swift' is the right one. If you aggressively use escape analysis to keep as many pointers unique/simple as possible, then you only reference count where really necessary.

You could also borrow some tricks from Erlang, and make all shared data read-only. But maybe this is too much like Rust's mutable-reference, maybe that is the best compromise after all?

shelby3 commented 5 years ago

@keean thanks again for helping to brainstorm on this issue.

If the reason we are choosing escape analysis is so that we can support parallelism (i.e. multiple code paths running simultaneously) or preemptive threading, for the code paths that access the same pool of young generation objects, then we also need Rust’s exclusive mutable borrowing for safety. I documented in the WD40 #35 issues thread that Rust’s exclusive mutable borrowing and lifetime analysis is unable to correctly analyse some cases. The flexibility of the programmer is reduced and the tsuris on the programmer is increased. Many programs will be forced employ unsafe code.

The complexity added by these coding gymnastics are why I think Rust is only best fit to mission critical extremely high performance programming such as embedded systems (and perhaps also subsystems of widely deployed software such as the rendering engine of browsers) and is not the right tradeoff to choose for a general purpose PL (programming language). If we’re not supporting preemption and instead only allowing thread switching on blocking calls, and if we group the code that we never want to want simultaneously (i.e. the group can run on different hardware hyperthreads but is effectively single-threaded), then as I had pointed out (in the WD40 #35 and Concurrency #17 issues threads) the reasoning about mutability invariants is restricted (and thus simplified) to mutability that straddles a blocking call1.

And these single-threaded groupings would enable us to assign one young generation objects fast compacting GC to each single-threaded group (again presuming my idea for young generation GC coupled with reference counting for old generation is actually the design we eventually want). This would not require the complexity of any escape, lifetime, nor mutability analysis. And would be entirely flexible other than the need to cordon these groups. Note the groups could share objects which are reference counted and there is no restriction on the number of groups, so parallelism isn’t forsaken (although opportunities for the finest-grained parallelism might be greater with Rust). I think the young generation fast bump pointer allocating and compacting GC would be nearly (perhaps 50%) as fast as stack activation frame allocation+deallocation. This seems more fit to a general purpose PL designed to replace JavaScript on the client-side and Go/Java on the server-side. I am not trying to replace Rust or ADA. Later for maximum performance would consider adding some lifetime and escape analysis for libraries that need to be maximally performant and so they can employ inner pointers as well.

K.I.S.S. principle.

Also please remember that easy-to-use PLs are more popular! I have my marketing hat on in addition to my engineer hat. I want a PL that is as easy-to-use as JavaScript but much better and easier-to-use than Java/Go/C++ yet also better in nearly every way (although not as absolutely equivalent in maximum performance to C++/C/Rust yet so much less tedious and perhaps we find a way to make the performance critical path coding as fast as those eventually).

Also it seems we really do not need to make this decision for the first version of the Zer0 PL which compiles to Go. We simply need to disallow inner pointers to start with. My main goal for the first version of Zer0 is not to maximize performance.

Once we have perfected the syntax and type system of Zer0, and have some experience writing code then we can move forward on maximizing performance within a goal of making 80+% of the general purpose programming uncomplicated even if that means sacrificing 50% of the performance. Then yet later again move on to maximizing the performance of the remaining 20-% of the coding to try to get as close to possible to C and Rust speed within the requirements of the 80+% of the code not being complicated.

Prioritization of effort is very important. Thus I think we need a somewhat hierarchical perspective on which PL paradigm should be used for the 80+% versus the 20-% of the code for each program.

1 My understanding is that currently Go’s goroutines are preemptive (but not as preemptive as Erlang) and there are no such guarantees other than the fact that code in a single goroutine runs single-threaded w.r.t. to itself. I tried to suggest to them the model of grouping goroutines that would be restricted to emulating a single-threaded group, and they seemed to think that had nothing to do with their M:N server model of scaling that they are focused on. We might be able to simulate these groupings on Go with our Zer0 compiler by having the compiler insert the scaffolding of channel block coding after each blocking code in a goroutine with some master control for the grouping of goroutines that blocks all but one of them in the group in a round-robin queue. So I suppose we could decide to emulate the groupings for a future young generation GC model with the version of Zer0 that will compile to Go.

@keean wrote:

I also think reference counting works better with parallel systems, this is like a 'C++' shared pointer.

Agreed. I can’t think of a better way to share allocated objects in the multi-threaded context. This is why I propose referencing counting for older generation objects (which thus can be shared between multi-threads).

This kind of makes me think that Apple's approach with 'Swift' is the right one. If you aggressively use escape analysis to keep as many pointers unique/simple as possible, then you only reference count where really necessary.

Apple’s approach is perhaps best for power-constrained mobile when perfection is prioritized over ease-of-programming. I’d prefer to try to be more clever if we can, and match that performance for the performance critical paths, but make the 80+% of the programming much less complicated and easier. There’s no way to do the most aggressive lifetime analysis without increasing the complexity for the programmer.

Rust and C++ are very complex because of what I wrote before:

Yet note the level of complexity in C++ in order to attain higher-level abstractions with very low-level control performance.

@keean wrote:

You could also borrow some tricks from Erlang, and make all shared data read-only. But maybe this is too much like Rust's mutable-reference, maybe that is the best compromise after all?

Yeah I think if we’re going to go for complexity with some onerous restriction of read-only for sharing or Rust’s exclusive mutable sharing, then might as well do the lifetime analysis and provide maximum flexibility within that higher complexity paradigm. But this is not the complexity I want in 80+% of my code! This added complexity enables multi-threaded shared access of young generation objects. I do not think 80+% of my code needs that capability.

Clojure also heavily leverages immutability. I also like the idea of supporting optional immutability restriction for objects allocated with reference counted pointers for sharing of older generation objects between multi-threads. Which is a simpler optional (let the programmer decide) paradigm that works with both my grouping for young generation objects idea or the Rust-like complexity for stack allocation.

I think Ada had a reasonable compromise, which was to keep the owning pointer to a heap object on the stack, and then allow local "fast" pointers. Then you just need an escape analysis to make sure those local pointers do not escape. The owning pointer is like a C++ 'unique' pointer, and the local pointer is like a plain '*' pointer.

My prior analysis did not favor ADA over Rust. Thoughts?

keean commented 5 years ago

@shelby3 The reason for read-only would be for better parallelism. Take a look at Erlang.

shelby3 commented 5 years ago

@keean I already wrote I would agree with optional read-only. But I do not agree with the onerous restriction that all shared data has to be read-only, unless I can be convinced of some overriding benefit of such an onerous restriction. Let the programmer decide. What benefits would we achieve by removing it from programmer discretion?

Immutability also increases complexity. And immutability is not solving the problem of fast GC in the coming highly parallelized hardware world (64+ hyperthreads). We still need either the young generation GC or escape/lifetime analysed stack allocation.

I do agree that shared data in a highly parallelized context will need to be immutable. But that doesn’t equate with all shared data must be immutable. It also doesn’t provide a solution to the GC performance problem unless we will make all objects (young and old generation) reference counted which would be very non-performant compared to the other options. So even with immutability, we still need to solve the young generation GC performance issue as this problem with grow worse as the number cores on the hardware increase (as you duly noted). This is why Clojure is in trouble in the future if running on the JVM with its single instance of young generation GC for all threads. The coming shift to high core counts hardware is probably the Achilles heel and death of the JVM.

EDIT: I think I was editing my post when you wrote your comment. Please refresh. I did not see your comment until I reloaded the page just now.

keean commented 5 years ago

@shelby3 https://www.erlang-solutions.com/blog/the-continuing-headaches-of-distributed-programming.html

shelby3 commented 5 years ago

@keean but that Erlang shill article is simply wrong. Message passing or even immutability don’t really solve the problems of shared state synchronization, rather they just push the race conditions to another level. @dmbarbour had already explained this to you:

http://lambda-the-ultimate.org/node/5209#comment-88463 http://lambda-the-ultimate.org/node/3108#comment-45406

There’s no panacea for shared state. It’s difficult.

EDIT: also they propose using a messaging paradigm also for local CPU and DRAM, but the latency and memory coalescing is much different locally than over the wire. They are pretending that there can be this nirvana of one-paradigm for both, but AFAICT the reality belies their fantasy.

I do not write my DRAM reads as asynchronous code— i.e. I expect memory accessing to block my thread. Coalescing or DRAM access is not an asynchronous model.

I think parallelism is more about algorithms than it is about one paradigm of programming. Attaining a very high degree of parallelism requires being astute with the algorithms employed and the overall holistic design of the program.


EDIT#2: ostensibly @keean is concerned about Amdahl’s law as it applies to contention overhead if all threads need fast local cache access to state (i.e. allocated objects) which are shared between threads. His point is that as the number of threads increases then the overhead of keeping all the thread local caches updated becomes eventually greater than any speedup by adding more cores. This is why GPUs don’t guarantee consistency of global memory for all threads unless (and until after) the writing thread completes threadfence() call (except for those in the same CTA block (c.f. also)3 with read-after-write synchronization obtained with __syncthreads()).

So @keean is apparently proposing that objects shared between threads should be read-only and sent as a message (e.g. where the sending thread blocks on the threadfence() call for sending/synchronizing to all threads in the case of local storage on a GPU instead of sent over the wire).

However his proposal appears to be suboptimal for some use cases. Which thread sends a data structure which is shared globally with all threads? The Erlang or Actor model doesn’t apply.

For example, a blockchain node thread which upon receiving a new transaction request does an update of a record in the UTXO data stored in global memory which is shared between all transaction handling threads on the node. The record should only be updated by that one thread which first received an applicable transaction request and the record is typically larger than the 512-bits for the “Six 64-bit DRAM channels” so no warp coalescing (c.f. also) is needed for optimal utilization of memory bandwidth. Apparently shared local memory is interleaved for optimal warp coalescing (c.f. also).

Thus immutability of the said record object is undesirable. Instead the thread which wants to update the record must use a mutex employing atomicCAS() and atomicExch() to fence off access to the record while it is being updated1. Either the contention over this mutex will be nearly non-existent (i.e. it’s presumed to be very rare that another thread will attempt to access the same record simultaneously2) so thus the cost of accessing the mutex in global memory as opposed to being consistent in the local cache for all threads is nil, because mutex will most often need to be written to global memory anyway. Or alternative strategy for the mutex can be employed that makes the contention irrelevant2. This alternative strategy potentially shares synchronized data only with some threads.

The UTXO is not a linked-list. I think it is probably an array, many GBs in size. And then a hashmap to map from txids (which are a 256-bit hash) to array indices.

We could replace UTXO records instead of editing them, so the records are immutable, but the array itself would not be immutable. But that does not remove the need for the mutex. Would still need to prevent two threads from operating on the same record simultaneously.

1 Note the unlock operation may also require a threadfence() to guarantee that the global memory for the record which was modified has propagated to all threads (outside the thread’s CTA block) before issuing the atomicExch().

2 Presumably there’s some anti-DoS defense mechanism in place to prevent a spammer from hammering the same record of the UTXO with simultaneous transaction requests. A Hashcash style proof-of-work is one possible anti-DoS method, but proof-of-work has an asymmetrical cost for typical legitimate (especially mobile) users compared to attacks with customized ASICs. Better if each transaction request requires a fee (and possibly additionally a forfeitable penalty fee if it’s a duplicate or follows too soon after a prior transaction request), but the problem is circular in that fee payments are also transaction requests. So indeed instead maybe we do need to propagate the mutex to every thread’s local shared cache, so then we’d like to employ a single bit for each lock (perhaps via atomicXOR()) to minimize the size of the cache require and bandwidth of the data that has to be propagated to all threads. Or could have one of or a subset of all threads whose only function is to check the mutex for other threads, so that the mutex data only needs to propagate to the local cache for less than all threads. If the number of records that is locked at any given time is small compared to the total number of records, a hashmap of locked records might be a small data structure if conservation of cache space is a higher priority than speed of lookup. Another strategy is that threads don’t have to block waiting for mutex to be checked and become unlocked, i.e. they could queue up locks and issue new mutex requests asynchronously to add to the queue while processing already locked records. So that with sufficient server performance, the DDoS defense can be by converting the attack to effectively a bandwidth attack which can be absorbed by perimeter nodes.

3 Note a Streaming Multiprocessor (SM) on the GPU can run more than one thread block simultaneously but each thread block can only access the portion of shared memory it has allocated. IOW, shared memory is not shared between thread blocks which run on the same shared memory cache, because that cache is allocated from separately for each thread block.

shelby3 commented 5 years ago

Please note that I have decided to continue in the Parallelism #41 issues thread, this up-thread discussion which veered off into the parallelism topic.

Let's please try to make all further replies about parallelism in that Parallelism thread.

shelby3 commented 5 years ago

I wrote:

@keean thanks for your feedback:

pointer + offset addressing is slower, and uses more registers. It is not just slower, bit slower on every operation, so taking an in-place quicksort as an example, it can result in a significant reduction in performance. This kind of thing will result in people wanting to write critical sections (filter kernels etc) in 'C'.

As I wrote previously, the pointer + offset is only necessary for not having horrible performance with pointers to fields when the base pointer of the allocated object needs to be accessed. So that doesn’t need to be pointer + offset, unless we need to compress the shortage requirements (i.e. if offsets use much less space than pointers), because can just make that a tuple of (pointer-to-allocated-base, pointer-to-field).

[…]

Where we can prove at compile-time that the lifetime of an interior pointer is contained within the lifetime of any other pointer to the base of the allocated object, then such interior pointers don’t need to be involved in the (referencing counting, generational, or mark-and-sweep) GC. Thus we can employ single-wide pointers for them, making them very fast. So presuming we will not be using double-wide pointers (such as for reference counted allocations) which would otherwise have negative impact cache locality, I propose we only allow interior pointers at the language level in the cases where they can be single pointers. For other cases, employ pointer + offset and/or pointer + pointer at the library level, such as for slices in the library implementations of arrays and strings and what not.

The Actor model proposal developed in the Parallelism thread will enable the aforementioned compile-time proof of safety, so that we can take the address of fields in reference counted objects and non-reference counted objects when assigning to a reference for a non-reference counted object.

No double-wide (aka double pointers) references will be required, except possibly when assigning the address of a field in a reference counted object to a reference for a reference counted object. If we will be using double-pointers for these cases, then they should be handled by libraries (i.e. not in the PL) because they’re single-wide and double-wide pointers would be non-fungible and create a “What Color is Your Function?” bifurcation. (Note assignments of non-reference counted objects to references for referenced counted objects is obviously not allowed)

An alternative so as to avoid the aforementioned bifurcation is to not employ double-pointers and instead incur the performance degradation (e.g. apparently Go employs a hashmap of addresses to base object addresses! yikes!) of finding the base of allocated object for these intra-object pointers (and as previously stated this is only necessary for referenced counted references), so that it can be provided fungibly by the PL. The base of the object is only required when the master (aka total) reference count needs to be changed (which would be on every assignment to another reference or destruction of a reference unless weighted reference counting is employed and/or except that Go employs static analysis to minimize when reference count needs to be changed). This would be a more attractive option if the performance degradation was only applied to reference counted references that pointer to intra-object fields and not to all reference counted references. This optimization would require tagged pointers so that the tagged bit could indicate whether the pointer is to the base of the allocated object. (Note without a tag on the pointer, a magic value could be used at the base of the allocated object, but this would waste space and would be impractical because would require somehow insuring fields never have that magic value.)

Tagged pointers also make weighted reference counting more efficient. Even when the local tagged weight goes to 0, the master (aka total) reference count can be increased and the local tagged weight set to maximum again. Unfortunately tagged pointers aren’t allowed on AMD64 (aka x86-64). Whereas on mobile, IOS on ARM64 supports up to 34 bits for tags. And Linux of ARM64 (aka AArch64) supports 8 bits for tags (c.f. also and also). Not employing double pointers for intra-object reference counted references (or not disallowing such references) would cause performance to be much worse for all reference counted references on x86-64 desktops and servers because they lack support for tagged pointers. But screw Intel and AMD, because ARM64 is coming to the server (c.f. also and also) and Qualcomm needs a new niche (c.f. also) such as this programming paradigm we’re designing into Zer0! Funding for the development of Zer0 may pour in from these large companies (Microsoft, Qualcomm, Google, etc) that want a competitive alternative to x86-64 for the server.

So if employing tagged pointers for weighted reference counting then a 0 count (because it should never be 0 for a reference to the base of the allocated object) could indicate that the reference points to an intra-object field.

P.S. Tangentially note AFAICT these different reference types (reference counted or not reference counted, and another proposed annotation) will not impact typeclasses. The compiler will determine if there’s some incompatible mixing and report an error, so AFAICT this doesn’t need to be modeled by the typeclasses. But I still need to some more thinking about the Mutable Objects #32 issues thread.

shelby3 commented 4 years ago

@keean I thought you might find the choice of ADA to fix “open sores” worthy of mention here in our discussions.