keean / zenscript

A trait based language that compiles to JavaScript
MIT License
42 stars 7 forks source link

Concurrency #17

Open shelby3 opened 7 years ago

shelby3 commented 7 years ago

I have a long comment at the Rust forum (some of the inspiration came from @keean linking me to Sean Parent's video presentation) and another one that I posted to another forum, which delve into the various options for concurrency and comparing them. Note the negative aspects of stackless coroutines enumerated in the second linked comment post.

Also note that my code that employs a ES6 generator to implement an emulation of async / await has never been tested, but I did update and correct it recently.

@spion wrote a blog on why ES6 generators offer more flexibility than async / await.

Continuation-passing-style (CPS) can be employed for stackless coroutines which lower overhead (which for example PureScript supports with a continuation monad), but they are very restrictive and don't interopt well with the imperative programming that we do with JavaScript.

Concurrency is hard and we are not aiming for maximum performance of context-switches, thus I think stackless coroutines are a pita and not worth it. For someone who needs that control, let them use another language (e.g. PureScript or C++) and interopt through our FFI or JavaScript.

I need generators to support custom loaders, my emulation of async / await, and in general to interopt well with JavaScript ecosystem such as Node.JS.

Thus I think we should support ES6 generators initially. And we can look into coroutines later.

Thoughts?

keean commented 7 years ago

I think generators and co-routines are the way to go. Everything should be eager unless specifically lazy. That one was easy :-)

shelby3 commented 7 years ago

So are you okay with not doing anything in the language for stackless co-routines? The stackless form can implemented with a continuation monad library combined with declared purity?

Edit: I have hence realized that monads are employed to enable data race safe sharing of mutable state emulated by copying immutable data, and has nothing to do with the choice of the underlying mechanism of the usermode (i.e. not kernel context-switching) cooperative multitasking. Monads enable controlling stateful effects within a pure, referentially transparent, immutable paradigm.

keean commented 7 years ago

Stackless co-routines are cool. Its basically the same as JavaScript's event model. If we are compiling to JavaScript we have to cope with this. If we write an infinite loop, JavaScript requires returning to the event loop to process IO.

Is it best to give programmers a model where they write a linear synchronous program, and its is converted to the event model underneath and runs on a stackless co-routine runtime (transpiling to 'Go' would be interesting), or is it better to give them an asynchronous event based model?

shelby3 commented 7 years ago

@keean wrote:

Stackless co-routines are cool.

But they have compositional (and perhaps other) trade-offs. They seem to force either low-level tsuris, or monadic tsuris. Remember in general that monads don't compose (Applicatives do compose).

Stackless co-routines are cool. Its basically the same as JavaScript's event model.

I believe that may be incorrect. JavaScript is using afaik stackful event loops (as in multiple event loops running concurrently, one each for each JavaScript instance). Afaik JavaScript is not using a jump in CPS style, rather the script must exit or control is passed to the event loop with generators and returning a Promise.

Edit: I have hence realized that JavaScript’s callback model unwinds the stack storing the callback context on the heap to return to the event queue scheduler. So there is only ever one stack in use.

If we are compiling to JavaScript we have to cope with this. If we write an infinite loop, JavaScript requires returning to the event loop to process IO.

This is accomplished with generators and Promises (or setTimeout <--- notice camel case), which afaik are stackful.

Is it best to give programmers a model where they write a linear synchronous program, and its is converted to the event model underneath and runs on a stackless co-routine runtime (transpiling to 'Go' would be interesting), or is it better to give them an asynchronous event based model?

Afair, you posed that thought to me that before on the Rust forum. Can you find my response? I find it arduous to locate details of our past discussions on Rust's forum.

Generally I have disliked the design choices of almost everything about Go's design and I think I had already explained why I didn't like Go's concurrency model, but I forget by now.

keean commented 7 years ago

What do you think about the actor model?

shelby3 commented 7 years ago

I remember you explained some aspects of Go and especially the concurrency model are applicable to a some use cases. And I recall I didn't (at that time) find those use cases compelling for the programming I need to do right now.

I am hoping (pie-in-the-sky right now trying to resolve my health ailment) to build a mobile app platform (headed towards abstracting away the operating system to a large degree); also I need to do a lot of coding for JavaScript for client and server reusing the same code base on both (i.e. Node.JS), so I need high-level abstractions, modularity, extensibility, and composition. If I were doing systems programming, I'd have different priorities.

When I need to drop down to systems level tsuris, I'll probably be hoping for something better than C++, but seems Rust may have blown that chance with the mandatory lifetimes (or maybe that is exactly the feature needed if people will use it). I might have a different perspective on Go for that use case. We’ll see.

As for Actor model, that is more than I want to tackle today. I'll come back to it.

SimonMeskens commented 7 years ago

I'm a big fan of the actor model. Just as an aside.

Anyway, I think CPS with tail optimization basically gives you the best possibilities, combined with syntax that compiles into a stack machine (ES6 generators work like this and async/await is just a particular generator).

I really really love JavaScript's concurrency model with setTimeout. Basically, in JavaScript doesn't use coroutines, it uses a polling based event loop. A game loop in JavaScript is written like this:

function loop() {
  console.log("ping!");
  setTimeout(loop, 0);
}

loop();

This means: run loop, then yield to the event loop and ask it to rerun loop as soon as possible. During the execution of the loop, no other code can run (not even UI events, there's no interrupt events in JavaScript). So when you write an infinite loop, you have to write a recursive loop that yields to event loop with setTimeout instead. Notice that giving it a timing of 0 means basically just infinite loop (0 means loop as fast as possible), but after each iteration, it allows the UI logic to update.

I personally think this is THE best system I've ever seen in a language. It makes sure that you get concurrency, without state issues. For lots of concurrency use cases, you don't actually need threading and JavaScript's model allows you to do it very elegantly.

So yeah, sticking to that same model seems logical to me.

keean commented 7 years ago

We could have a first class actor model:

actor Queue :
    list : Nil
    add : (activity) =>
        list.push(activity)
    remove : (activity) =>
    // etc

The idea is the actor can have state but all the 'methods' must be pure, and we model calls as asynchronous message passing (which can be optimised to a normal function call.

SimonMeskens commented 7 years ago

I'm not sure if I agree with syntax, but yeah, first class actors would be extremely interesting to have. I think actors also combine very elegantly with generators and iterators for a VERY powerful concurrency model.

Combined with a yield to event loop syntax, this would make ZenScript close to the most powerful concurrent language on the market.

I just wonder if we can somehow marry records and actors under the same system, so all actors are records or all records are actors. I'd also like to get delegation into that system. I'm not very familiar yet with the current ideas on records, so I'll have to read up on that and come back to you.

keean commented 7 years ago

Records are data, so would be the messages sent between actors, or in the internal state of actors etc. So actors are not records, actors receive and respond to messages (they act), records are just plain data (although in theory a message could contain an actor address, or even a serialised actor). So we end up with something like actors can contain records, and records can contain actors, but actors are not records.

SimonMeskens commented 7 years ago

I'd argue you can view a record as an actor that responds to queries (messages) for data in a specific format, but I get your point. We can always look into this issue later, it's not very critical. It'd also devolve into a discussion about non-deterministic arbiters and such, which is also something I don't want to get into.

In summary: first-class actors that are encouraged and influence library design = good. I envision libraries that offer lots of functionality through monadic abstractions over actors in a way that makes it look like any other language and I smile 😄

shelby3 commented 7 years ago

@SimonMeskens wrote:

During the execution of the loop, no other code can run (not even UI events, there's no interrupt events in JavaScript). So when you write an infinite loop, you have to write a recursive loop that yields to event loop with setTimeout instead […]

I personally think this is THE best system I've ever seen in a language. It makes sure that you get concurrency, without state issues.

There’s still the potential for data race unsafety for the mutable data shared between your loop and any other event handlers that run every time your loop does a setTimeout and returns to the runtime’s event dispatching scheduler.

And the callback you provide to setTimeout has to save the state and restart location of your loop’s progress. It’s a lot of cruft compared to Go’s green threads (aka gorountines) which enable what appears to be sequential synchronous code.


Regarding my comment post about Actors (← click for link to my definition of unbounded nondeterminism), I found the following interesting:

https://youtu.be/nXYGPYnCokE?t=541

He makes a very interesting point that aspects of models which do not preserve bijective ("bidirectional") structural equivalence, are thus introducing the aspect of thermodynamics that processes are irreversible. What he really should say is such a model doesn't require a total order and allows partial orders, which is entirely necessary for modeling asynchrony. A total order is bounded, whereas the set of partial orders is unbounded.

https://youtu.be/nXYGPYnCokE?t=1460

The statement that "forgetting is the essence of thermodynamics consequences" is from the perspective of the any relative observer inside the system, but that doesn't mean the information no longer exists from the perspective of a total observer outside the system. The existence of a total observer can't be required in the system without bounded nondeterminism. So unbounded nondeterminism requires forsaking omniscience. As I had mentioned in the past, universal omniscience requires a non-quantifiable speed-of-light (so that any observer can be total in real-time), which would collapse the light cones of special relativity rendering the past and future indistinguishable and nothing distinguishable could exist.

shelby3 commented 7 years ago

More on comparing Actor model to π and rho calculi:

https://youtu.be/oU4piSiYkE8?t=488

The Actor model doesn't have an explicit algebra of channels (channels apparently require additional overhead because expose a race condition of the concurrent processes which can access the channels), although channels can be created in the Actor model by creating a new child Actor for each channel, where the "child" relationship is maintained in the state internal to the involved actors in terms of the child actor forwards all received messages to the parent actor. Thus apparently the π and rho calculi algebraically model this particular configuration of the Actor model. So the distinction is what can be analyzed algebraically in the model and not an enhancement of computational power of the model. An analogy is I guess the NAND or NOR gates from which all digital logic can be constructed, yet we don't program in such a low-level model. So the Actor model is fundamental but more low-level than the π and rho calculi.

shelby3 commented 7 years ago

Please check my logic, facts, and analysis in this comment post.

Go's green threads (much more lightweight than OS threads) are both more and less performant than JavaScript's callbacks (i.e. Promises) depending on the usage scenario.

For example, when a stack of nested functions is waiting on the result of the innermost function, Go's very low cost context switch which retains the stack is more performant than unwinding the stack and essentially copying it into a nesting of Promises for accomplishing returning from each nested function to the top-level JavaScript event loop.

Vice versa, when parallelizing independent code paths¹ (irrespective to scheduling algorithm) to make multiple blocking operations concurrent, Go needs a goroutine and thus an allocated stack for each concurrent operation; whereas, JavaScript only needs to allocate a Promise for each operation.

An example of this latter case is invoking map for a range (e.g. of an array) with a blocking function as the input, such that we want the invocation of the blocking function for each element of the range to run concurrently (but not necessarily executing simultaneously parallelized depending on whether hardware threads scheduling are employed). So for JavaScript, the code would invoke the blocking function on each element saving the returned Promises in an array and assigning continuation code to the then method to save the result for the output of the parallelized map operation; whereas, the Go code would spawn a goroutine with the continuation code for each invocation. The continuation code of both would decrement a counter to track when the parallelized operation was complete. The JavaScript completion code would fire the callback in the Promise returned from the parallelized operation; whereas, for Go the goroutine owning the parallelized operation would sleep awaiting the 0 value of the counter. When more than one hardware thread is employed by the parallelized instances, then the counter decrement operation must be a mutex (which could alternatively to shared write access, be effectively achieved by accumulating all decrements as messages to a single thread which performs the decrement, which also requires critical section synchronization for the updating the message/channel queue).

Hardware thread scheduling with JavaScript is afaik currently non-portable (e.g. no Web Workers on Node.js) and less flexible (e.g. shared objects are forced to be thread safe by copying or borrowing[Edit: there’s SharedArrayBuffer but as such it’s of limited functionality because SharedArrayBuffer doesn’t integrate with GC, Object, exceptions, nor yield generators]). Web Workers implementations are not forced to implement M:N green threading. So optimization of hardware utilization is less flexible on JavaScript. I conjecture that it is hypothetically plausible to combine both Promises and green threading in some optimized way but not currently in possible in JavaScript.


¹ Rob Pike states an incorrect definition of concurrency in the general case because in general there is no way to prove that nondeterministic composition of code paths are independent because this would require a total order which is the antithesis of unbounded nondeterminism, a.k.a. in the indeterminism case (where the unbounded entropy of the universe is intertwined in the state machine of the program), i.e. it can't be proven that the state interactions in concurrency are faultless or fault tolerant.

Sean Parent provides a different definition. John Harrop’s definitions of concurrency and parallelism seem to jibe with mine.

And Pike's error is also a weakness of Go's CSP (communicating sequential processes, i.e. channels) in that without bounded nondeterminism then it is impossible to guarantee the channel communication won't deadlock as I wrote before:

as contrasted with Go's CSP that attempts to more tightly synchronize and bounded determinism

Even though concurrency in general can potentially deadlock on any dependent state:

An Actor employs futures (aka Promises) on result values to create a decentralized stack which is claimed to prevent deadlocks on mutual recursion but clearly it won't prevent deadlocks on locking a resource so resources must not be locked while waiting for response to a message. Not to be confused with asynchronous sending of messages, which means not blocking waiting for receiver to acknowledge receipt, which is one of the differences from Go's CSP.

shelby3 commented 7 years ago

Go is correct that the first use case example from my prior comment can be always be handled by the language and appear to the programmer at the source code level to be a synchronous function call. That assumes the programmer intends to not execute any of his source code concurrently while waiting for the result of the said blocking function. The programmer could convey this intent by assigning or casting the result value to the value type of the Promise for languages employing Promises and by not passing[waiting on?] a channel to the blocking function in a language which employs channels such as Go. In that case, the compiler could infer that it should arrange the necessary concurrency (implemented either via event loop model or green threads).

However, the only reason the programmer needs access to Promises or channels for this usage scenario is to manually encode parallelism inherent in the algorithm (i.e. independent code paths)— but which the compiler could in theory infer if it is informed as to which functions are pure and which objects are read-only, borrowed, etc. Given this latter information, the compiler could even infer the encoding of the inherent parallelism in the second use case example from my prior comment. Even the (presumably recursive algorithm) of the loccount tree traversal parallelism in Eric Raymond's Go blog, could be hypothetically inferred by the compiler.

So I come back to my interpretation of the generative essence of Sean Parent's thesis, which is that the more completely and accurately we can express our algorithms, then less fragile optimization kludges will creep into our code. In other words, the programmer's job is to convey the algorithm and the compiler's (and runtime's) job is to optimize it. In other words, if the compiler can infer the parallelism, then it is free to optimize the number of threads dynamically (and even perhaps to choose between an event loop or green threads on a case-by-case basis). Compilers and runtimes thus become smarter and programming becomes simpler with more algorithmic clarity (less implementation details and premature optimization clutter).

Thus I am asserting that both JavaScript and Go have some catching up to do with Haskell which I've read it already quite good at optimizing and inferring parallelism and other optimization expressed in the algorithm. And Haskell is too obtuse and as I explain below that 100% purity is too restrictive (as we discussed in the thread on purity it forces purity everywhere as total order, requiring monads to model effects as types yet monads aren't generally composable). There is hopefully a better balance that could be achieved. A marriage of pure, declared side-effects (effects as types?), and read-only declarations with an imperative language and without all the category theory when it isn't necessary.

Sometimes it is necessary to borrow the lifetime, because just declaring an object read-only is insufficient to convey the algorithm. For example, for inferring the parallelism if we know the input tree object to the loccount example is borrowed, then we don't care if the per-file counting function is pure, because we know it can't modify the input tree object if we aren't providing the said function a borrowed reference to the tree. But we only need to borrow it as read-only, so it is not an exclusive borrow except exclusive of writing. However, as @keean and I remember from our analysis of Rust, borrowing is a global total order exclusion paradigm. Whereas, if we know that the per-file counting function is exclusive of the input tree object (i.e. in a partial order of exclusion), then we don't need the paralysis and tsuris of a total order on borrowing. The only way I can think of to achieve that partial order is if the said per-file counting function is pure or we make a read-only copy of the input tree object. Note a pure per-file counting function could read from the files without making itself impure. A function which writes to file can also be "pure" (aka referentially transparent) w.r.t. to any context which doesn't read from the file, because it doesn't call any function which modifies any state of the program, i.e. that some external context can read from the modified file is irrelevant if the compiler is only analyzing inherent parallelism for a limited context which doesn't read from the file. So it seems perhaps we need not only pure function declarations but also declarations of specifically which side-effects an impure function creates.

Another advantage (in addition to the aforementioned simplicity, clarity, and antifragile flexibility of the source code level expression of the algorithm) of having the compiler infer the parallelism versus the programmer manually encoding the parallelism, is it won't infer it incorrectly due to human error thus creating a bug. One of the main goals of compilers is to do the tedious and complex checking for correctness that humans don't naturally do perfectly. We need humans for the creativity of algorithms, not the pedantic repetitive mundane tasks and premature optimization.

shelby3 commented 7 years ago

If I understand correctly the stance of those (e.g. @keean in private communications) who would argue against the viability of my prior comment, they argue that such optimization is equivalent to a proof search which is an exponentially hard complexity problem, e.g. optimizing a bubble sort to a parallel merge sort. But afaics, I did not propose allowing the compiler to change the algorithm. I merely proposed for the programmer to express the invariants necessary to infer the parallelism inherent in the algorithm expressed. I've read that one of the Haskell compilers is already apparently very good at this, which is enabled by its strictness w.r.t. such invariants such as referential transparency and side-effects lifted to types (monads).

I agree that compilers which are expected to modify algorithms (e.g. from a bubble sort to a parallel merge sort) would be equivalent to an exponential search which is probably in one of the very intractable computational complexity classes. I am not yet clear if my proposal is also in an intractable computational class.

The argument that claims all (or nearly all) of the interesting parallelizable code can be put in libraries carefully hand-tuned by experts seems to potentially leave a lot of userland code stuck with the noise hiding the underlying algorithm with all the premature optimization of not adopting my proposal. Also it means that expertly written library code becomes that much less readable in open source. I don't like the Cathedral model of s/w development. I prefer open source, readability, simplicity, clarity, and not premature optimization by a small group of experts who can disappear or totally muck it up and no one knows how to fix it.

Also the point of having the compiler detect the parallelism (i.e. the independent code paths) is that it needs to be able to do that anyway in order to check that the parallelism that the concurrency that the programmer would otherwise manually encode would not violate the (lack of) parallelism (i.e. required data race safety) in the algorithm (c.f. Rob Pike's misunderstanding of concurrency versus parallelism). Thus in order to not involve human error and bugs, we need the express the invariants to the compiler so the compiler knows which code paths are independent. Without invariants, the compiler can only safely assume that none of the code paths are independent and thus there is no parallelizability in the code. The reason concurrency programming is so difficult (i.e. buggy) is because humans can't reason perfectly about all the invariants.

Rust's strategy is to force us to have a total order of (provably safe from deadlocks and livelocks) compile-time “locks” (borrows) on all resources in the program including all allocated objects. But it seems to be uneconomic to force such pedantic tsuris of synchronization on all code, when the highly parallelizable code may only be a small portion of the program. Instead I have proposed that we only declare the invariants we need to, for the cases where we need parallelization.

keean commented 7 years ago

Parallelising an algorithm is equivalent to changing the algorithm. Modern super-scalar out-of-order CPUs already exploit the maximum inherant parallelism in code, and they do so at runtime, where there is more information about scheduling and completion times than at compile time. A compiler cannot beat the CPU at this kind of optimisation, hence the failure of VLIW CPUs in general computing. Again writing the Titanium compiler to achieve the same results at compile time as an x86-64 achieves as runtime turned out to be a hard problem.

shelby3 commented 7 years ago

optimizing a bubble sort to a parallel merge sort

That is parallelizing the task of sorting, not inferring the inherent parallelization in the original bubble sort algorithm (if any).

Parallelising an algorithm is equivalent to changing the algorithm.

In the example code below i and j can both be inherently read in parallel, as long as the compiler knows that the two reads are independent from each other. The algorithm is not changed by running those two reads concurrently. The invariants (which if declared, insure the two reads are independent enabling the inherent parallelization) are part of the expression of the algorithm.

var i = ... // read string from a file
var j = ... // read string from the Internet

return i.concat(j)

Modern super-scalar out-of-order CPUs already exploit the maximum inherant parallelism in code

They can't optimize the inherent parallelizability in the example above, because they don't know that the two reads are independent. Those invariants aren't expressed at the low level of machine code.

You are conflating low and high-level parallelism.

keean commented 7 years ago

I see what you are talking about. There is no guarantee that the file and the network are not connected. Reading the file may change the result returned from the network. Such optimisations are not safe in general.

The programmer needs to express whether the IO operations need to be sequential or parallel. In the above example they are inherantly sequential. Parallelising may break the programmers assumptions.

There needs to be a way for the programmer to say it is safe to parallelise. Something like:

(I, j) = parallel(read file..., read network)
shelby3 commented 7 years ago

There is no guarantee that the file and the network are not connected. Reading the file may change the result returned from the network.

If the reading is declared pure, then it can't have side-effects. This should obviously be enforced.

Such optimisations are not safe in general.

I am not proposing in general. I am proposing only when the invariants insure the independence of the code paths.

The programmer needs to express whether the IO operations need to be sequential or parallel.

Disagree. Seems you are missing the entire justification for (thrust of) my proposal. Now you are trusting the programmer to know all the invariants of every possible ramification in the universe, i.e. back to JavaScript and Go's model. And manually encoding the concurrency details instead of the compiler taking care of that detailed tsuris and noise. This causes bugs and makes code less comprehensible (obscures the algorithm). If instead we build up invariants via hierarchies, the same as we do for coding in general (i.e. we call functions which call other functions, i.e. a purity declaration is transitive to functions we as the programmer aren't even aware of), then the compiler does that reasoning for us. We generally don't want programmers to have to do non-local reasoning, as that is not modularity.

Again I am not yet sure if my proposal is computational complexity tractable. But it is also not necessary that the compiler infer every parallelization opportunity. The programmer can still fallback to manual encoding when the compiler fails to infer. It can be a compiler technology that improves over time.

I want to make some major innovations in programming language, if possible.

keean commented 7 years ago

The system does not know the invariants. The file that you read could trigger a network service to change the data downloaded from the network, such that changing the order from sequential to parallel will change the results output by the program. In general the programmer is the only thing with knowledge of the invariants, unless they are somehow expressed in the code.

shelby3 commented 7 years ago

The system does not know the invariants.

We don't refute a potentially correct proposal by refuting one example where the invariants could not be insured (thus could not be declared and wouldn't be inferred). That is disingenuous (noisy) discussion to not admit there are examples that can be insured.

It does in some other examples. I had alluded to an example in one of my prior comments, where the compiler knows from the declared invariants that reading a file doesn't modify any objects (stored in RAM) in the code paths.

The file that you read could trigger a network service to change the data downloaded from the network

Not if the invariants insure that the network resource does not dependent on any such network services. My example presumed the invariants were insured.

Tangentially, you are making the case for why in general concurrency is indeterminant and there are in general no independent code paths and thus in general concurrency is always buggy (and can't be fixed), because the entropy of the universe is unbounded.

But my proposal is about limiting inference of inherent parallelism to insured invariants.

In general the programmer is the only thing with knowledge of the invariants

If the invariants aren't insurable, the programmer manually writing the concurrent structures isn't going to make it any less buggy. You've gained nothing except all the negative aspects (I've already enumerated) of not inferring.

If the compiler can’t check the data race safety of the parallelism the programmer explicitly requests, then the same bugs can happen. I guess your point is that it’s better to at least supplement what the compiler can check with the programmer’s explicit intent to have a better chance of avoiding data race safety bugs for that which the compiler can’t check.

Edit: you can make the argument that the programmer can declare that his algorithm doesn't care if the result of j is nondeterministic w.r.t. to i and the unbounded entropy of the read operations. Thus your manual encoding becomes useful in that case where the invariants required for deterministic parallelism can't be insured. I wasn't arguing against still allowing that manual encoding when necessary (to express unbounded nondeterminism or to handle deterministic cases the compiler isn't smart enough to infer).

keean commented 7 years ago

I think there are three scenarios, these must be sequential, these must be parallel, and "don't care". The problem with the above example is it is ambiguous between "must be sequential" and "don't care". Programmers used to writing sequential code will assume sequentiality. We need a way to distinguish don't care, and sequential.

We both know this is not a problem for pure code, as beta reduction order is non-deterministic, but we cannot do IO from pure code (exactly because it makes evaluation order visible to external IO).

keean commented 7 years ago

In general IO assumed sequentiality, consider:

print("Shall we play a game?")
let game = getString()
shelby3 commented 7 years ago

In general IO assumed sequentiality, consider:

print("Shall we play a game?")
let game = getString()

You are writing and reading from stdout and stdin, thus of course those invariants tell the compiler there is no inherent parallelism. I don't see any inconsistency with my proposal and your example.

If instead you were copying that UI string to a dialog box control and then reading from an input field of that dialog, then you'd have inherent sequential code path expressed by the dependencies on the access to the dialog object resource.

shelby3 commented 7 years ago

The problem with the above example is it is ambiguous between "must be sequential" and "don't care".

There is no ambiguity because if there are no dependencies, then it doesn't matter which order i and j are instantiated.

I understand what you are thinking is the programmer may have some semantics in mind which are not expressed unambiguously in the code itself. Can you think of an example?

keean commented 7 years ago

It only doesn't matter because you assume the file and network operation are independent. Let's assume they are dependent, so reading the file changes the result from the network service. How do I express that the order matters, even though the operations are on different 'channels'?

keean commented 7 years ago

Here's a more practical example, first write a file to a USB stick, and then display a message that it is safe to eject the USB device. If you arbitrarily parallelise IO then the message will appear whilst the file is still writing, and you will remove the device resulting in a corrupted data file.

shelby3 commented 7 years ago

Here's a more practical example, first write a file to a USB stick, and then display a message that it is safe to eject the USB device.

Correct programming code would not display that message until testing that the return value from file write operation did not return an error.

You could argue that correct code would issue an exception on error (instead of a return value) which would bypass the code that follows the file write call. Exceptions thus make it ambiguous whether the code following the file write call is sequentially dependent on the successful execution of the write or not. This is a problem of expression due to void functions. And this would need to be dealt with somehow.

Perhaps the most general construct would be to allow void functions to be assigned to an object of Void type. Dependent code paths could condition on if( object exists ). Or probably better is just assign the function to an object of type Promise (or infer the type desired by call .then() on the function return value), and attach the dependent code using the then() method.

Thus semantically dependent sequential code paths become explicitly more verbose (where they would not have been) so that independent code paths can be inferred.

shelby3 commented 7 years ago

It only doesn't matter because you assume the file and network operation are independent.

I think it is correct that in general we can't ever insure independence between IO external to the memory controlled by the program.

Just because it was a not carefully chosen example of my proposal, doesn't mean my proposal is broken.

But for example if the program has opened two files with exclusive read/write access, then can insure that no writes will occur that the program doesn't know about. If the program has opened the network resource with exclusive access, it can be sure the network resource will not be written to while it holds that exclusive access. So in that way my example was plausible.

How do I express that the order matters, even though the operations are on different 'channels'?

We probably want to assume order always matters between IO external to the memory controlled by the program, except where we can insure exclusive access of the IO resources. So your question should I think instead be how do we express that order doesn't matter for distinct channels? In other words, how do we declare that a pair of channels are known to not interact.

Perhaps we shouldn't even support such a declaration since it seems so fragile of an insurance. I think in such cases the programmer can just manually encode the concurrency for the presumed parallelism. However as I stated, the exclusive access case is valid case, but those invariants are not on the function called but on the holistic context. Need to think about how such invariants could be expressed.

keean commented 7 years ago

Just because it was a poor example of my proposal, doesn't mean my proposal is broken.

I agree, I like the JavaScript approach, combined with Promises, but the stack cost it high. By introducing co-routines/co-functions, the stack cost can be reduced to be similar to Go's green threads, and can be based on an N:M concurrency model, and promises are not needed. The JavaScript implementation would by necessity be single threaded, but it should not be limited to only JavaScript as a backend, and of course we can use 'WebWorkers'.

shelby3 commented 7 years ago

I am thinking my idea for parallelism can for the inferred cases increase clarity of source code and provide compiler verification of correctness. Those seem like major wins. But I don't know yet how tractable the inference might be.

And hypothetically, the compiler could optimize what it is employing for scheduling tasks, e.g. either green threads or Promises depending on each use case, for I had explained that Promises are more efficient in some use cases, i.e. I think we need both in the same program.

keean commented 7 years ago

A cofunction models a promise, it behaves like a blocking API, and does not unwind the stack back to the main event loop every time. Unwinding is an imementation detail - it should be possible to implement cofunctions on top of promises and green threads.

shelby3 commented 7 years ago

I agree, I like the JavaScript approach, combined with Promises, but the stack cost it high. By introducing co-routines/co-functions, the stack cost can be reduced to be similar to Go's green threads, and can be based on an N:M concurrency model, and promises are not needed.

My proposal is about making the expression of the parallelism agnostic to task scheduling concepts employed to execute them concurrently. Parallelizable code then looks very similar to any other code, because it isn't littered with premature optimization of embedding a particular scheduling concept in the source code.

The key insight is the separation-of-concerns, i.e. that parallelism is due to independence invariants. The invariants are a separate concern from the non-parallel expression of the algorithm, because in many cases the algorithm doesn't care if it is executed in parallel or not (and the invariants inform us of any inherent parallelism that can be optionally exploited).

The compiler and any runtime will be free to optimize and experiment with different scheduling concepts without the programmer needing to change the source code.

I am proposing to think about the issue from a more high-level conceptualization.

Higher-level programming languages were much more productive. I think we need higher-level parallelism now also. It is the next step of the evolution.

keean commented 7 years ago

I don't think this approach can work in general, and even if it can it will lead to a complex compiler and very inconsistent runtimes (hidden magic).

My preference us to make the default for imperative code to be sequential execution, for pure functional code to be "don't care" and for there to be some special syntax for explicit parallelism. If co-functions result in not needing promises, because a promise chain looks like ordinary sequential imperative code, then what is needed is the equivalent of 'Promise.all', effectively a way to run an array of commands in parallel and wait for them all to finish. If normal tuples imply left-to-right evaluation (to preserve side effect order in function calls) then all we need is a 'special' tuple for parallel evaluation. Maybe [ ] could be used or perhaps a digraph like (| |) to indicate parallel evaluation. This could be applied stand alone or in any function application, so f(|x, y|) would evaluate x and y in parallel instead of left-to-right.

shelby3 commented 7 years ago

I don't think this approach can work in general

Because?

I specifically said the compiler didn't need to infer every case. The programmer can manually encode when the compiler is too dumb to (while retaining the original version for if the compiler ever gets smarter).

and even if it can it will lead to a complex compiler and very inconsistent runtimes (hidden magic).

Like Java's Hotspot. Bad improvement to have isn't it.

... what is needed is the equivalent of 'Promise.all' ...

Parallelism doesn't in general fit into one nice consistent concurrent structure.

keean commented 7 years ago

Hotspot is still slower than 'C', and has poor performance with parallel code (due to large memory requirements). Hotspot has not got "ever smarter" and this just had not happened in reality, despite the wishes of language designers. I want a language that works in the real world today, not some fantasy future that may never happen.

There are many complete models of parallelism that are simple like Communicating Sequential Processes. So parallelism can be fitted into one convenient structure. Many languages only provide 'fork' and 'join'. Above I provided a use case that breaks the optimisation assumption that you made - can you provide a use case where the simple model of parallelism I presented is not enough? We should always try to use the simplest model that provides what we need.

shelby3 commented 7 years ago

Above I provided a use case that breaks the optimisation assumption that you made

Sorry that is a mischaracterization of the discussion. Please that doesn't help us advance our work.

Hotspot is ...

Hotspot doesn't have all the invariants it needs to optimize high-level parallelism. I proposed declaring more invariants. I provided Hotspot as an example of how programmers don't have to do every optimization prematurely.

There are many complete models of parallelism that are simple like Communicating Sequential Processes.

You have a strange conceptualization of "simple". Hard-coding premature optimization of structure is not simple, because the entropy (i.e. surprise) of real systems (even programming projects) is not bounded. The model may be built from simple components, but its expression in code is far from simple.

keean commented 7 years ago

How can I put it differently: The invariants you want will surprise and confuse programmers used to the sequentiality of imperative code, and lead to a lot of bugs. Such assumptions are only safe for pure functions.

Hotspot is an example of how if you try and put too much complexity into the compiler it becomes a 20 year project for large organisations like Sun and Oracle. It is a white whale, and I don't want to be Ahab. I think the correct method for dealing with complexity and optimisation is to leave it to the library authors. If you make the low level tools available in the language then experts can use the language to write efficient tools for programmers to use as libraries. This keeps the compiler simple, and the performance predictable. I am sure you know that every optimisation for one case is a pessimisation for another. There are some sets of numbers for which a bubble sort (less than 10 numbers for example) is faster than quicksort. Optimisations tend to be domain-specific, that is the optimisations to write a fast matrix library look very different from the optimisations for a monte-carlo simulator, or a graphics library. This is why no high level language has succeeded in beating simple 'C'. The 'C' author is always able to exploit their knowledge of the problem domain to write code superior to that produced by applying optimisations in the compilation of high level code. I propose this as a general hypothesis - "No optimising compiler for a high level language can ever have optimal program performance in all problem domains" - so therefore this approach is doomed to fail for a general purpose language. In summary invariants can help compilers produce faster code, but it will never beat a human given precise semantics. What no language has done (that I can think of) is allow the experts to write precise code and the general users to write high level code in the same language. C/C++ is great for the low level detail, Haskell is great for high-level, what we need is a language that bridges the gap.

Regarding CSP, I mean where the model is simple (very few primitives) yet you can build any complex system using it. This is the very definition of expressive power.

The model may be built from simple components, but its expression in code is far from simple.

Yes this is exactly what we want. We do not want a complex sprawling mess of primitives, building a complex system on a complex model is much more prone to error than building a complex system on a simple model.

shelby3 commented 7 years ago

@shelby3 wrote:

Perhaps the most general construct would be to allow void functions to be assigned to an object of Void type. Dependent code paths could condition on if( object exists ). Or probably better is just assign the function to an object of type Promise (or infer the type desired by call .then() on the function return value), and attach the dependent code using the then() method.

I think I like better the ability to turn exceptions into Option types by assignment from try:

let result = try function_that_throws()
if (result != None ) // result is the thrown error
else // non-parallelizable (i.e. synchronous) code
shelby3 commented 7 years ago

Hotspot is an example of how if you try and put too much complexity into the compiler it becomes a 20 year project for large organisations like Sun and Oracle. It is a white whale, and I don't want to be Ahab. I think the correct method for dealing with complexity and optimisation is to leave it to the library authors.

While I agree decentralized solutions are more antifragile, I think we are doing too much handwaving and aren't specific enough w.r.t. to differences between what Hotspot is optimizing and what I am proposing.

This is why no high level language has succeeded in beating simple 'C'.

To beat C at which goal? So JavaScript and Python have no advantages over C for any use case?

Low-level coding has its tradeoffs and that is why it is not best to do premature optimization and instead profile the code, then rewrite only the critical performance portions in a low-level language.

Parallelization[concurrency] is not just the difference between faster and slower execution, and often also the difference between responsive and unresponsive due to allowing independent tasks to respond independently. The compiler + runtime could in theory optimize the tradeoffs (even dynamically at runtime). Why should every programmer have to reinvent such optimization if we can build it in standard?

We do not want a complex sprawling mess of primitives

Aren't you building a strawman alternative? Did I propose such as the alternative?

A proliferation of library variants for each premature optimization case is also a form of a complex sprawling "mess".

shelby3 commented 7 years ago

@keean you want to build systems inductively from small substructures up in source code to more complex superstructures. I am referring to the dual of building systems co-inductively from small superstructures in source code down to more complex substructures in compiled implementation.

I am not saying either is inherently better than the other. There are tradeoffs.

keean commented 7 years ago

I agree with the principle of what you are saying, and I am glad you mentioned 'antifragile'. I think a co-inductive approach would work too. For example I want to create a matrix library, and I do so using garbage collection and a simple sequential approach. I can come back and optimise memory usage and parallelise the library later. As such it is the abstraction of the interface for the library that is important, not necessarily any invariants inside the library.

If you have a great algorithm for parallelising code then maybe there is something worth looking at? Hand waving about magical compilers that parallelise code using unicorns is not going to get us anywhere :-)

shelby3 commented 7 years ago

Agreed that abstraction is the co-inductive approach.

If we can express an algorithm in its simplest abstraction and then prove that other variants of the algorithm are equivalent, then we have the analogy of a compile time type checking. I am only proposing to abstract the parallelism invariants of the algorithm and prove those are equivalent for variants of the algorithm. Even that is useful without an automated way to produce the variants, because can express our abstraction succinctly and then verify that all implementations of the abstraction are correct.

Then as you say, it is more useful if we can automate optimum variants of the abstracted invariants. I was thinking about the example of walking a tree graph and doing some work at each node and then collecting the results into a duplicate tree graph. I haven't written down this in code, but I suppose a recursion (hopefully tail recursive) approach is the most elegant expression of the algorithm. If the parallelism invariants are such that each node in the tree can be computed independently from the other, seems there should be a deterministic translation to a parallelized variant of that same algorithm. But I haven't actually tried to code this yet. Any thoughts?

shelby3 commented 7 years ago
f(node, output, callback) =>
   output.data = callback(node.data)
   if (node.left)
      output.left = new Node
      f(node.left, output.left, callback)
   if (node.right)
      output.right = new Node
      f(node.right, output.right, callback)

So the compiler can deduce that when only one of node.left or node.right are not null, then the function is tail recursive and make that optimization instead of manually encoding a while loop into the function.

Also the compiler can deduce that the two branches of the tree can be run in parallel if callback is pure and the reference to node is read-only (which the compiler can see is not written in the function). Strictly speaking we'd like to know we have an exclusive reference to output so it won't be written by external code while our function is running (such as if callback is a blocking function), but that is really an invariant at the discretion of the caller as it would be the case even if we didn't parallelize f.

keean commented 7 years ago

So traversing over data structures is generics by iterators and co-ordinate structures. A binary tree has a bifurcating iterator which looks like this: "left_successor, right_successor". Using this iterator we can build three traversal algorithms "in-order, preorder, postorder", and these can be generalised into a single higher order traversal algorithm where order is determined by a function parameter.

If we limit ourselves to linearisable data structures (so order of node access is irrelevant to the algorithm) we end up with a generalised 'fold' algorithm.

This all leads to two simple algorithms that we can use and parallelise, "map" and "reduce". (You may have heard Google talking about map/reduce architecture. "Reduce" is a fold and map is the standard map of functional programming.

So "map" takes a container and applies a function to every element independently resulting in a new container of the same shape. Reduce calculates some property uniformly over the whole of a container, although it can be sensitive to the traversal order, many reductions are not, however the classic "fold" is ordered, we would have to impose invariants on reduction to make it not depend on traversal order (reduction functions must be binary, commutative, associate for example), given that reducing '+' over a collection would sum the elements and could be parallelise into a binary tree of add operations.

As you can see 'map'able functions are about the only things that are trivial to parallelise, and even 'reductions' become difficult to enforce the invariants in a programming language, because we must prove a binary operator to be both associative and commutative. If we rely on the programmer to use it correctly then reduceing '-' over a collection is obviously going to produce random results (because the compiler will change the order depending on available resources)

shelby3 commented 7 years ago

As you can see 'map'able functions are about the only things that are trivial to parallelise

Not true. There are others that are trivial just from the compiler observing dependencies.

Perhaps what you mean is that most parallelism opportunities will not be trivial? If so, do you have any basis to think that?

I am only aiming to have the compiler handle trivial cases to start with. I am mostly aiming to eliminate (and have the compiler infer) that explicit repetitive yield and Promises + generators boilerplate which isn't even a case of parallelism (rather just concurrency and not blocking JavaScript's event loop). So minimally I am just wanting to give each function a type which indicates whether it can and should be executed asynchronously. Goroutines do this implicitly because blocking operations automatically cooperatively multitask across other goroutines, but an analog to Promise.all doesn’t exist.

even 'reductions' become difficult to enforce the invariants in a programming language, because we must prove a binary operator to be both associative and commutative

fold is implicitly sequential unless we prove those additional invariants. I don't see a problem.

keean commented 7 years ago

The compiler optimisation passes happen after type inference and before output generation. These passes usually consist of matching part of the AST and rewriting it into a new AST (tree rewriting). Effectively an optimisation pass gets the program AST as its input, and will have to do the necessary checks to ensure the semantic meaning of the program is not changed by the optimisation before doing the tree rewrite.

So it is easy to add these kind of passes to the compiler, the harder task is to understand under what conditions any given optimisation is valid, and write the code to check those conditions before rewriting the AST.

It would be great to see a sample tree-rewrite and associated checks for parallelising something like a "for(;;)" loop.

shelby3 commented 7 years ago

As I said, my initial goals are simply to abstract away trivial boilerplate which the compiler can detect because the function is returning a Promise and the value type of the Promise is being assigned. So no need to write yield and use generators or async/await. The compiler can insert that noisy boilerplate. Also I want the compiler to infer some simple cases of running two non-blocking operations in parallel. The more complex cases I can leave to better algorithmic thinkers than myself as a future improvement. I can always hand-code any other parallelism I need which the compiler can't yet automatically infer.

I am trying to abstract away dependence on Promises, so that for example my code is generalized to work with optimization of using green threads for more efficiency in those cases. And also so my code is less noisy. And trying to abstract away JavaScript's warts and annoying idiosyncrasies.

Also please be aware my immediately goals are anything I settle on for the time being must be something I can implement in a transpiler in about a month or so. I want to K.I.S.S. so I can use this in production code immediately if possible. May not be realistic, but I am taking a shot at it. But my patience to abandon it if it is taking too long, is very short.

keean commented 7 years ago

I don't see how you can abstract away the yields, they are doing something semantically different. Yield is effectively sending a response to the sender if the initial message and waiting for the next message in the protocol. If you abstract it away you totally change the semantics. Consider:

f(x) =>
   f(yield(x + 1))

If you call this as f(1) you get back (g, 2) where g is the continuation and 2 is the result. Yes you could inline the yield as you would inline a function with the usual time/space tradeoffs (and remembering that cache locality has a big performance impact so too much inlining and unrolling can be bad).

As you can see, at this point yield is about flow control, not parallelism, there is only one thread so far, and we always wait for operations to complete.

What is missing is a way to hand off a task to another thread or external hardware, and continue to do other things until the result is returned. We want this to look like a normal blocking call, for example:

let file = readFile("xyz")

Inside readFile we want to save a continuation and send a message to another thread. However sending a message can look like a normal function call which yields the result, except threads cannot ever be referentially transparent (because their state can be changed by yet other threads we cannot see).

If we think about low level primitives then continuations are one option (but probably too powerful, like goto). You also need a way to spawn threads, and send messages between threads. You also need to be aware of thread overhead, and that messaging threads takes 100s of clock cycles, so in many cases doing things sequentially is faster and more efficient. When running things in parallel threads you need a way to manage state update. Parallelising code by dependency analysis (which is like a dataflow language) is different from asynchronicity, where a background thread does some processing, like reading data from a disk, where one thread can pause whist other things run. This code needs to be re-entrant and capable of being run in different thread contexts without overwriting it's own state. There are really three different problems here that need to be handled in different ways. Pure code without state can be used in several contexts at once without problems. Dealing with state and multiple contexts is probably the hardest problem for a programmer.