Serializing the state of the Starlark program/REPL

Is there a way to serialize the state of a Starlark execution at a certain point of execution, and sometime later load from the serialized state to continue execution?

Context: I am implementing formal methods system that uses a Python'ish language. To simplify the implementation, I am using Starlark (with Go), so execute individual statements. It is a model checker, so whenever there is a choice of two possible transitions, the model checker will explore both the possibilities. For example, randomly choosing between head/tail will be model checked as, the state of the system will be cloned. In one world, the coin would have come out head and we will explore that path. In the other world, the coin would have landed tail and we will explore that path.

To implement this, I see only a few options.

Execute one statement at a time: This is the current implementation, but slow. After each statement, the state is just the global variables. That is, loops, function calls, conditionals etc have to be reimplemented. After each statement execution, the global variables are frozen. So, these cannot be used as it is to execute the next statement. Therefore, all the global variables should be manually cloned recursively. This has two major issues.
- This is slow
- maintaining the reference structure is harder :( For the initial launch and small models, this is okay, but the performance issues show up badly as the users model larger systems.
Use ExecREPLChunk: Almost the same as the previous approach, but using REPL related exec methods prevent freezing the variables. This reduces the need to deep clone the state. This also has two issues.
- This is slightly better than before, but the API doc says, the API compatibility might be broken in the future release. I understand this comment was added in the source code long long ago, but curious if this is still the case.
- Maintaining the reference structure is still hard

I am looking for any alternate options available to clone or serialize/load the state of the execution.

If it is possible to serialize the REPL state and be able to restore, this would simplify the implementation significantly. Is this possible?

I'll share here my responses to another user who contacted me privately asking for a very similar feature. I've paraphrased their questions and edited my answers a little. To be clear, this feature is complex and invasive and I am not convinced it is worth supporting (or even fully feasible) in this repo; my advice below should be thought of as how to prepare of a fork of the interpreter that supports it.

...

How can I halt execution at a point, save the entire execution state in a persistent manner (on disk or db, for example), and then continue execution at a further point in time?

Does the state in your case involve running threads, or just the state of the heap after all threads have finished? The latter is a strictly simpler problem because it doesn't require you to get into the guts of the interpreter to the same degree; you just need to implement a GC-like marking phase over the heap and, for each object, serialize it. Of course it requires that you know how to serialize every type of object you encounter, so it needs the "closed world" assumption: it can be implemented if you control the entire application, but not as a library linked against unknown new types of starlark.Value.

Given the closed-world assumption, it seems like it should be relatively easy for you to fork starlark-go and add the hooks you need; you shouldn't need to change the original code very much to do this, which should make it easy for you to keep up with patches (which are in any case infrequent). Do you expose any Go APIs that mention starlark-go? If so, this would of course make the problem harder.

We completely control the types of values available.

You would need to serialize the state of every thread. That means every Starlark frame (the operand stack, and all values reachable from it; the iterator stack, and all iterators) and every Go frame, including the local state of functions like sorted, min, and max, which all make Go->Starlark calls. Min and max make more than one, so you would need to remember the logical program counter too; and you'd need to do this for every built-in you've defined that can make callbacks (and for every future one that you add). If any of them hold locks, you'd need to record that. And then you need to arrange for both Starlark and Go frames to be resumable at a given logical program counter that makes a starlark.Call. And of course you'll also need to make sure that both ends of the channel agree on the Starlark version: you can't suspend in one version and resume in another.

Why do sorted, max, min, etc, have a local state?

If your task is to save the stack of a Starlark thread so that it can be resumed later, then you need to save the state of any Starlark function implemented in Go that happens to be active too. Functions like sorted, min, and max make callbacks, so can appear in the middle of the stack, not just as a leaf. Therefore you will need to redesign those functions so that they can be suspended and resumed. Of course, it's highly unlikely that any of these three particular functions will make a callback that does more than compare two values, far less trigger a thread suspension. But that's the discipline required by the model; and perhaps in future you will need to add more important Go functions that make Starlark callbacks.

what if we restrict ourselves to suspending stacks of pure Starlark functions?

It does make it simpler. You would need the act of executing the suspend operation to cause every active frame on the stack to record its state into a suspension (serializable continuation). You could do that by handling ErrSuspend after each CALL operation at each frame, or you could record the necessary information beforehand, similar to the way we update frame.pc for each program counter increment. Either way, you would need to ensure the operand stack and iter stack were saved in the frame or continuation. That's the serialization part. For deserialization, you would need to change Call and Function.CallInternal so that, if called on a Thread with an associated continuation to resume, it would rehydrate the operand stack, set the program counter, and enter the loop at the resumed PC ("in the middle"), immediately triggering another Call, which would recursively restore the rest of the old stack. The thread's "resume" continuation would be discarded and execution would continue as normal.

Thanks @adonovan for the detailed answer.

In my case, it must be a lot simpler.

I always run with a single thread. The state needs to be captured only after that thread completes. No need to pause a running thread.
Closed world assumption. The users can define a new type using a custom language, that would be translated to a specific go type that the application manages. (They all instantiate a single go type Role, with different name/fields). It does not expose a go library for users to create or directly manage starlark-go types.

So the solution still is to manually go over all the objects in heap and construct the object graph like GC's Mark phase.

Thanks @adonovan for the detailed answer.

In my case, it must be a lot simpler.

I always run with a single thread. The state needs to be captured only after that thread completes. No need to pause a running thread.

Closed world assumption. The users can define a new type using a custom language, that would be translated to a specific go type that the application manages. (They all instantiate a single go type Role, with different name/fields). It does not expose a go library for users to create or directly manage starlark-go types.

So the solution still is to manually go over all the objects in heap and construct the object graph like GC's Mark phase.

I agree, those two assumptions do make things a great deal simpler as they mean the only types of Value you need to deal with are your Protean "Role" type, plus those defined by the interpreter itself. The main types--string, list, dict, and so on--are all API-complete, so you can make a perfect clone of a dict by making queries on the public API of the original one. But that's not the case for a handful of opaque values such as that returned by the range function: for those, you'll need to create new API to access their representation. I suggest you start with a quick prototype to see how much new API is needed. If it's just the range type, we could make it API-complete too, but I suspect there are others (e.g. string iterators).

Actually, range is immutable, so I don't even need to clone. For now, the main reason I was asking for serialization is it has a higher likelihood of being already present or to be implemented than a clone functionality (like CompiledProgram type to store and load compiled code for repeated execution). That is, if I could have a way to serialize REPL session and reload from a saved state, I thought I could use it for cloning. And coincidentally, https://github.com/google/starlark-go/blob/master/repl/repl.go had LoadModule function, that looked similar to Python library that can save/reload the interpreter session https://github.com/uqfoundation/dill/blob/e09908f4ba480ee326f538aa05715c1cc919bdf2/dill/session.py#L328

In any case, I will implement it the way you suggested.

google / starlark-go

Serializing the state of the Starlark program/REPL #557