urbit / pm

Core Development project management
9 stars 0 forks source link

Core Architecture, ~2023.11.20 #14

Open belisarius222 opened 11 months ago

belisarius222 commented 11 months ago

We discussed the Seer proposal, involving removing the Nock 12 opcode and the corresponding .^ ("dotket") rune. After a lot of back and forth, we came to a consensus that we will first add a scry move type to Gall agents and add a new metacircular interpreter gate like +mule or +mole that runs Nock that cannot contain Nock 12 opcodes.

Once we have those two features, we will experiment with userspace code that doesn't use dotket. If that code works well, we will deprecate the .^ rune and the Nock 12 opcode in favor of move-based scrying. This is a conservative, incremental approach that will not break any existing code and lets us try out the new style in anger before committing to removing Nock 12.

For some background: The Nock 12 opcode and the dotket rune are used by userspace code to ask the kernel for data in the scry namespace. When Urbit runs userspace code, that code is not raw Nock. It is instead called Mock, which is Nock with two additions: deterministic errors are caught and returned as values to the kernel, and there is an extra Nock opcode, opcode 12. A simple Nock 12 formula looks like this: [12 [1 /foo/bar/baz]], where /foo/bar/baz is a path literal [%foo %bar %baz ~], representing a path in the scry namespace. The result of running this opcode will be the result of performing the scry request, i.e. dereferencing the path.

Catching deterministic errors is uncontroversial, but the Nock 12 opcode poses some difficulties and has been under scrutiny for years now. The most tangible problem with Nock 12 is that it prevents userspace code from persistently memoizing Nock computations; at the moment, only kernelspace code can hold onto Nock memoization caches across multiple Arvo events. The lifetime of a userspace memoization cache is limited to a single metacircular evaluation context, i.e. a single call to +mink or an equivalent gate that runs Mock (virtual Nock with the extra opcode).

The reason userspace code can't memoize Nock persistently is that it wouldn't be feasible for the runtime to prevent a jet mismatch in the metacircular evaluator gate. A gate like +mink takes in a Mock expression, i.e. a cell of [subject formula] and a "scry handler gate" (which I'll describe shortly), and returns a value that's either [0 computation-result] in the happy path, [1 block] in the case of a failed scry request, or [2 error] in the case of a deterministic error.

Error handling isn't that relevant to this discussion, but I'll include a brief description for thoroughness. The error in the [2 error] result of +mink is a deterministic error, meaning every attempt to evaluate that [subject formula] pair will always result in that same error value -- deterministic errors correspond to the lines in the Nock spec where *a evaluates to *a, i.e. a trivial infinite loop. Nondeterministic errors, such as running out of usable memory or getting interrupted by the user, result in the function failing to complete, and the Arvo event will be aborted and rolled back to its state before trying to run this new event.

As I mentioned, the [subject formula] pair is not the only input to +mink. There is also a scry=$-(^ (unit (unit))) argument: the scry handler gate. When +mink encounters a Nock 12 opcode while evaluating the [subject formula] expression, it evaluates the sub-formula to obtain a path to look up in the scry namespace. Once it has this path, it calls the scry handler gate on the path, which returns either a [~ ~ value] positive scry response, a [~ ~] permanent scry failure, or a ~ "block". The positive response means the scry was performed successfully and yielded a value, in which case +mink injects the value as the result of the Nock 12 opcode and continues evaluating the Nock. The permanent failure means the scry namespace knows there will never be a value at that path, so +mink immediately returns a deterministic error.

A scry block means the system refused to answer the question, i.e. the request was "blocked" for some reason -- this could be due to lack of permission, or because the request asked about the future, or the request asked for data from another ship that isn't synchronously available within this Arvo event. If the scry blocks, +mink returns a [1 block], where block is the scry request path whose value could not be resolved.

The difficulty with the Nock 12 opcode is that it makes the result of the computation dependent on some external data that is not in the [subject formula] that's running. This means the result is no longer a pure function of the subject and formula. Depending on what the scry handler gate is, you could get different answers. Consider [subject formula] of [0 [12 1 /foo/bar]]. This returns the result of looking up the /foo/bar scry request path from the scry gate. Userspace code (which is untrusted) could call (mink [[0 12 1 /foo/bar]] |=(^ [~ ~ 33]), get the result of 33, then call (mink [[0 12 1 /foo/bar]] |=(^ [~ ~ 44]) and get 44. The subjects and formulas were identical, but because they were evaluated with different scry gates, they got different results.

How is this considered kosher, given Urbit is supposed to be a pure-functional system? The answer is that Mock, virtual Nock, is purely functional (mathematically, a "partial function" like Nock itself) as long as it's only ever run using the real scry namespace, which is referentially transparent. That is to say, all data in the scry namespace can be thought of as something like axiomatic data that everyone agrees on and never changes: all virtual Nock computations will get the same answers to all Nock 12 opcodes that complete, because everyone agrees on which value is bound to each path in the namespace.

You could alternatively think of the scry namespace as an extra implicit argument passed to every Mock computation along with the subject and formula. This extra argument is discovered dynamically over time as more ships bind more values, but since any path can only be bound once, it's immutable and all Mock expressions that dereference some scry path will either fail to complete or complete with identical results, hence pure functional.

Note that there's no requirement that all the scry gates be noun-equivalent (intensionally equal), or even contain the same set of namespace paths and values (extensionally equal). The requirement is that no two scry gates ever return different values for any given scry request path. A scry gate that always blocks is trivially legitimate, since it never gives any answers, much less answers that differ from those of another scry gate.

The Arvo kernel's scry handler gate, as described in the original Moron Lab blog post about Urbit, is a "static functional namespace", so even though that gate changes frequently, it maintains the invariant that every scry path's binding to a value is immutable -- well, up to some known flaws that are being addressed in UIP-0116: Arvo Ticks.

As long as a correct Arvo kernel is supplying the scry handler, Mock retains its purely functional nature. But consider again the case where userspace code runs the same [subject formula] twice, but with two different scry gates that give different answers to some scry path. From the perspective of a simple Nock interpreter, it's still deterministic: +mink is just some Nock function that runs some other computation (Mock) on some inputs. Because it's just some Nock function, it can't break determinism of the Arvo event; replaying the Arvo event will still yield the same result. The problem only arises when trying to get clever about how the runtime evaluates Nock, in particular by caching the results of Nock computations.

Development versions of Vere support persistent Nock memoization. This means a programmer can emit a Nock hint to the interpreter asking the interpreter to cache the result of the computation, so that if it gets run again, the result will be retrieved from the runtime's cache instead of rerun step by step. The cache key is a [subject formula] pair. When the interpreter sees a %memo hint, it looks at the [subject formula] that the hint surrounds. If the memoization cache has a value for that key, the runtime uses that value as the result of the computation. Otherwise the runtime runs the computation, inserts the result into the cache, and returns the result.

For the moment, only the kernel can place items into the cache. Userspace code runs Mock, so if it were allowed to place items into the cache, there would be three options:

1) Let userspace code place arbitrary values into the cache. This poses a severe security problem: event replay would be nondeterministic, influenced by the contents of the memoization cache, allowing malicious userspace agents to perform a kind of denial of service attack by preventing event replay, or possibly even worse, by injecting cache lines that other applications would use, causing those other applications to malfunction in a way the malicious code wants. Worst case would be causing a system app such as %hood to fail to enforce a security check, allowing the malicious code to escalate its own privileges, allowing it to ask the system app to do something like reinstall the kernel.

2) Place the scry gate into the cache key for the memoization cache, so a cache key would be [[subject formula] scry-gate] instead of just [subject formula]. This would close the security hole, but since Arvo's scry gate undergoes small changes on between every agent activation, it would make these cache lines useless outside of the immediate execution context within a single +mink call -- exactly the situation userspace code is already in, so this wouldn't help anything.

3) Have the interpreter record the set of scry [path value] pairs looked up by the computation. This would allow userspace code to perform scry requests safely, but the cost is a nontrivial amount of machinery in the runtime, along with a computational cost of looking up a cache line that grows linearly with the number of scry paths that were requested. Of the three options listed here, this is the most reasonable, but it is still highly nontrivial and the requirement to compare all requested paths could be somewhat onerous.

Instead of these three options, we have turned off the ability for userspace code to modify the Nock memoization cache. This is enforced by the jet for +mink. The jet is simply a call back into the normal Nock interpreter, which is really a Mock interpreter, whose state includes a list of nouns representing the stack of scry handler gates for the current execution context. If this list is null, then we must be in the kernel, so the code is allowed to modify the memoization cache; otherwise, we are in some kind of Mock context, and the code is not allowed to modify the cache.

The concerns with dotket go deeper than memoization, despite that being the only known concrete issue with it, other than some more complexity in the runtime when jetting the metacircular interpreter. Conceptually, the fact that a Mock computation has implicit external dependencies makes the pure-functional nature of the system more fragile -- the memoization issue is caused by this fragility.

Morally and aesthetically, there's also an issue: the fact that Nock makes no assumptions about any of its contents is part of the magic of Urbit. Nock is a true functional programming language with no implicit external dependencies at all, and no constraints on the nouns it operates on. Anything that pollutes this vision is potentially harmful to the environment Urbit creates for programs and their authors. Mock's implicit dependency on the namespace makes Nock context-dependent instead of context-independent, as it should be.

There's also an issue that userspace code does not have access to raw Nock, only Mock. The Hoon compiler willingly outputs Nock 12 instructions when handed a dotket rune, and there is no means for userspace code to signal that it will refrain from using Nock 12. One of the two action items from our discussion yesterday was to add a new metacircular evaluator gate that does not run Nock 12's, so that userspace code run inside that gate can reestablish the guarantee that it is purely functional, allowing it to modify the memoization cache. Note that it's still ok for that gate to be called from Mock code, or even nested Mock code, that had previously performed scry requests that were not referentially transparent: raw Nock is context-free, so no matter how the [subject formula] cell was created, the result of its evaluation is still pure.

The other action we will take is to add a move-based interface for scrying to the Gall interface. A Gall agent will be able to emit a new kind of move, in addition to all the moves it can emit now, that asks Gall to perform a scry request on the agent's behalf. When Gall processes this move, it will perform the scry request and convert the result of the scry into a response move that will be delivered back into the agent in a later activation of the Gall vane, in accordance with the normal move processing order that Gall and Arvo use. This means an agent could emit a list of moves like [poke-1 scry-A poke-2 scry-B poke-3 ~], and Gall will handle all of those moves in the same order as if they were all pokes, i.e. the kernel will not introduce any exceptions to its normal move processing order for scry request moves.

There is a general consensus that this new move-based scry interface is better suited as a lowest layer of the system than the Seer monadic interface. The fact that Seer or dotket (or at least an opt-in form of Mock, rather than it being default as it is now) could be implemented on top of moves, but not necessarily the other way around, militates in favor of moves at the lowest layer. Another benefit of this arrangement is that by making all scrys async, the same interface can be used for both local and remote scrys, since remote scrys are always async. Abstracting over local vs remote removes what would otherwise be an unavoidable branch point in userspace I/O code, and it also matches the rest of Gall's interprocess communication model, which does not distinguish between local and remote. A strong argument can be made that this is also the most "Grug-brained" option, which is not to be undervalued.

There remain concerns about application development patterns if we don't have dotket. Losing synchronous read access to other parts of the system potentially reduces the ergonomic wins of a naturally growing transaction across applications. This can be implemented in a way that is almost synchronous for local scrys, and in particular it is still possible to perform multiple scry requests at the same system snapshot, meaning the requesting agent will see a consistent view of the system without any tearing.

Gall will also need to run the +on-peek arm of a Gall agent in the Seer monad, i.e. that arm will return either [%done scry-result] to indicate it's done or [%scry scry-request-path callback] to ask Gall to perform a scry request, in which case Gall performs the scry, injects the result into the callback gate, and repeats, until a %done is returned. This should not be a problem really, but it's a form of complexity creeping into the Gall interface.

Because of these concerns with the proposed move-based replacement for dotket, we will first add it as a new feature alongside dotket before committing to removing dotket. That way if the application patterns end up suffering inordinately, we could decide to keep dotket. That being said, plan A is to deprecate dotket once we've verified that move-based scrying provides a good enough application model, and then eventually remove it.

Quodss commented 9 months ago

If Seer monadic interface is going to be built on top of moves, I suggest expanding this interface to add poke+watch-one interaction to simplify handling of non-referentially transparent data from agents