Universal VM - virtualised operating systems

void4 commented 7 years ago

This is a disorganized collection of notes, I hope I'll be able to formulate what I look for soon.

The objective: Find or create a distributed general purpose computing system

Subobjective: Let it work in untrusted environments

Main question: Is it sensible to build on available primitives or is a new language based system required?

Why are current systems not good enough?

Hardware or operating system level virtualization and containerization have several limitations:

CPU or OS architecture dependent (limited cross-platform capabilities)
high setup and runtime overhead even for simple computations, recursive hosting
isolation of subsystems still needs another sandboxing layer
limited resource control within containers
programmatic interaction through system boundaries difficult

Advantages of a language based system:

Assuming full program state reconstruction and standardized program state representation (VM):

tasklet suspension, serialization and transfer, deserialization and resumption on another machine

Assuming well encapsulated objects and consistent I/O interface, version compatibility:

transmission of very small tasks, instead of only large applications or entire VM images

Assuming determinism:

resource usage metering (instructions, memory, storage, network)
not only result, but intermediate program state verification and consensus possible

Assuming reflection:

fine grained resource control (memory, storage, processor time, network etc.)

Close languages

http://doc.cat-v.org/inferno/4th_edition/limbo_language/limbo http://erights.org/ https://hyperledger-fabric.readthedocs.io/en/latest/overview.html

Projects

http://www.distributed.net/Main_Page http://wry.me/~darius/software/idel/ https://github.com/primea/design https://boinc.berkeley.edu/ https://golem.network/ https://computes.io/ https://truebit.io/ http://iex.ec/

Reasoning

http://e-drexler.com/d/09/00/AgoricsPapers/agoricpapers.html

There exist several active projects wanting to extend the Web's capabilities to allow for better resource sharing and machine collaboration, tackling problems like data distribution (IPFS and many others), distributed computing (Golem, BOINC, ...) and consensus (Ethereum and other blockchains). In the end, they all end up defining their own data structures and languages to remain platform independent. If they were to succeed, current operating systems, web browsers and applications would have to implement their definitions and languages which, without a common target language and metaprotocols does not seem possible.

The common language of todays web seems to be very weak when compared to the possibilities of a distributed virtual machine. Imagine, for example, a sandboxable VM with the ability to serialize, distribute and execute running processes somewhere else in the network. Something similar might be possible to implement today, but only upon layers and layers of abstraction. The same issue will occur with most decentralization projects once they want to drive adoption and target other platforms.

But what language should be used? I've been looking for it for some time now but can't seem to find a very good candidate. I hope, but am not sure whether WebAssembly can provide a good solution to this.

If the Inferno operating system or similar concepts had succeeded, we might not have these limitations today. Current operating systems cannot provide developers with a cross-platform "thin waist" for application development, so the inner-platform effect of the web continues. Therefore, it seems also reasonable to think about building new operating systems from the bottom up, however ambitious that may be. It may not be possible to replace the incumbents, but bootstrapping a new architecture and platform independent language based system like Inferno could be the next step forward. There's also urbit.org, who try to build an OS on top of a lispy combinator interpreter: https://github.com/cgyarvin/urbit/blob/master/doc/book/1-nock.markdown. Unfortunately, many of their design choices are very esoteric and make the high level language extremely time consuming to understand.

Open problems

In general, I'm interested in both public and private deployments. I see these open problems in this space:

Choice of a secure general purpose computing platform

This might be the Javascript, WASM or any intermediate representation format or custom VM.

The Golem project uses another approach, similar to the BOINC-platform: Instead of relying on security on the language level, it uses OS sandboxing capabilities. Additionally, clients only download software that has been approved by client selected verifiers.

Task distribution, coordination and incentivization

This might be anything from a centralized server to a decentralized market built on smart contracts. In the decentralized case, scalability becomes difficult. One-to-one Payment channels are functional as of right now, but for networks with thousands of users, we might have to work for projects like the Raiden network.

I have no idea how (or even if) that will work out, but it is an extremely interesting topic. See http://swarm-gateways.net/bzz:/theswarm.eth/ for example.

Result correctness guarantees in untrusted environments

These might be social and/or economic incentives.

The Truebit system uses economically incentivized verifiers to ensure that tasks processed by untrusted parties are correct. It utilizes the fact that the Ethereum VM is fully deterministic and lets the verifiers compare the merkle roots of merkle trees over the entire program state at specific points in time:

https://people.cs.uchicago.edu/~teutsch/papers/truebit.pdf

BOINC compares the results of multiple participants.

Further notes:

I'm not sure whether a purely language based system is desirable or even sufficient to solve these problems. Many of these tasks are typically in the realm of operating systems and few languages (most probably in the Smalltalk realm) are that dynamic and reflective.

The ability to suspend small tasks and resume them elsewhere might have further advantages, though few languages support this functionality. Especially in low level languages, this would have efficiency trade-offs, as reconstructing state is expensive. I love the idea of jumping tasklets, though.

I'm looking for a VM or language specification with the following properties:

runtime serializable first class continuations or process image
simple fully deterministic VM with the option of disabling side effects (sandboxing)

Optional: homoiconic

Another interesting property would be the ability to explicitly run the VM for a certain number of steps as in Stackless Python: http://stackless.readthedocs.io/en/2.7-slp/library/stackless/pickling.html But this seems to depend on the VM architecture and process model

http://canonical.org/~kragen/memory-models/ http://wiki.c2.com/?ThereAreExactlyThreeParadigms

https://github.com/ipfs/notes/issues/100

Everything I think about comes back to this, be it distributed applications, computing or consensus. The inner/second platform effect of browsers has already led to the development of web desktop environments but ECMAScript doesn't seem to be great universal target for other languages.

asm.js
WebAssembly
DOM APIs
nodejs

http://xlr.sourceforge.net/concept/metrics.html

Lisp, Scheme http://iolanguage.org/ http://wry.me/~darius/software/idel/why.html https://www.destroyallsoftware.com/talks/the-birth-and-death-of-javascript

There is an inherent tradeoff between

simplicity/implementation complexity vs speed/expressibility/hardware "impedance match"
state reconstruction: native vs high level computing

Nock tries to circumvent this by defining "jets", subtrees of the AST which are executed in native code instead.

explicit marking
internal (AST transformation) vs external optimization http://www.vitanuova.com/inferno/papers/dis.html
boundary between interpreter execution environment and program context!

http://norvig.com/lispy.html http://iolanguage.org/guide/guide.html#Objects http://xlr.sourceforge.net/concept/operators.html https://mitpress.mit.edu/sicp/full-text/book/book-Z-H-26.html#%_sec_4.1 https://en.wikipedia.org/wiki/Reification_(computer_science) http://elixir-lang.org/

http://wiki.erights.org/wiki/Main_Page

The Scheme programming language reifies continuations (approximately, the call stack).

Homoiconic languages reify the syntax of the language itself in the form of an abstract syntax tree, typically together with eval.

Many languages, such as Lisp, JavaScript, and Curl, provide an eval or evaluate procedure that effectively reifies the language interpreter.

It is important to note that true first-class continuations do not save program data – unlike a process image – only the execution context.

https://en.wikipedia.org/wiki/Continuation-passing_style

http://wiki.erights.org/wiki/Future_research_topics

control flow vs parallel reduction
evaluation order, special forms
obsoleting containers

https://en.wikipedia.org/wiki/System_image#Process_images http://stackoverflow.com/questions/480083/what-is-a-lisp-image http://franz.com/support/documentation/current/doc/building-images.htm https://en.wikipedia.org/wiki/Lisp_machine

I'm also looking for operating systems programmed in languages based on a virtual machine.

"If you are building a world computer, you will also need a world operating system."

https://en.wikipedia.org/wiki/Language-based_system

This because of two major trends: the increasing popularity and necessity of abstract virtual machines and the convergence of web (desktop) and standard operating system environments.

So far I've found, by architecture: Register:

https://en.wikipedia.org/wiki/Inferno_(operating_system) https://bitbucket.org/inferno-os/inferno-os/

Stack:

Combinator interpreter:

Interplanetary OS: https://www.youtube.com/watch?v=Pjyo2uILcOs

https://en.wikipedia.org/wiki/Julia_(programming_language)

https://sandstorm.io

Regarding platform independence: http://doc.cat-v.org/plan_9/4th_edition/papers/9

The intended style of use is to run interactive applications such as the window system and text editor on the terminal and to run computation- or file-intensive applications on remote servers. Different windows may be running programs on different machines over different networks, but by making the name space equivalent in all windows, this is transparent: the same commands and resources are available, with the same names, wherever the computation is performed.

http://erlangonxen.org

https://news.ycombinator.com/item?id=8801372 http://wiki.c2.com/?InfernoOs http://ngnghm.github.io/blog/2016/06/11/chapter-10-houyhnhnms-vs-martians/

recursion

https://github.com/tc39/proposal-frozen-realms

wanderer commented 7 years ago

Imagine, for example, a homoiconic language

what properties do homoiconic langs have that help us here?

void4 commented 7 years ago

Homoiconic languages have code (and some times state) serialization built in, otherwise it's not really necessary.

When considering AST-based languages like Lisp, a disadvantage of this property is that memory allocation and modification metering become weird and it is necessary to define tree serialization formats, unlike with linear memory variants (WASM etc.).

Btw, I have no idea what I'm doing, so thanks for the input :)

Do you think I should use a subset of WASM, even if it's not optimized for "self management" or recursive self hosting?

wanderer commented 7 years ago

what does "self management" mean here? and what kinda recursive properties are you looking for? wasm should be thought of as a language that humans write. Maybe a better question is todo a case study on a particular language and see if it can be compiled or ran on wasm efficiently.

void4 commented 7 years ago

"Self management" should refer to the ability of a process to manage its own resources.

"Recursive self hosting" should refer to the ability of a process to manage a computation (code+state) without using a second operating system process (green threading), either by communicating with the parent interpreter, or executing an interpreter itself (which should then be optimized by the parent interpreter).

Do you know any languages which have a unified representation of code, data and program state?

void4 commented 7 years ago

@wanderer Have you seen http://www-systems.cs.st-andrews.ac.uk/gh/pub/gh-03.pdf and http://genode.org/ ?

wanderer commented 7 years ago

yep! pretty interesting. But thought about toying with this in primea-hypervisor with webassembly. So the way to do it in wasm is to make the memory of the wasm instance persistent. The current problem with this is that it would also require a new programming lang. But it shouldn't be hard to do

On Fri, Jun 16, 2017 at 7:15 AM, void4 notifications@github.com wrote:

@wanderer https://github.com/wanderer Have you seen http://www-systems.cs.st-andrews.ac.uk/gh/pub/gh-03.pdf ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/void4/notes/issues/5#issuecomment-309000252, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJqA4W7nc5Gpb8nHnCnS64W8MeAPoqoks5sEmO_gaJpZM4MRyq7 .

void4 / notes