When we can specify the order in which multiple code cells run by using Edges?

Woomou commented 1 year ago

https://about.cocalc.com/2022/09/08/all-about-computational-whiteboard/ showed me a function: Specify the order in which multiple code cells run by using Edges, (directional arrows), between Jupyter cells. How is this function progressing? I want to contribute to the development of the useful feature. 😘 @swenson

williamstein commented 1 year ago

Good question. It should be easy for me to implement this with a little design help. Can you tell me what your personal preference is regarding exactly how the edges influence or determine the run order? That would help a lot, since I got kind of stuck on deciding that.

Thanks!

Woomou commented 1 year ago

@williamstein A simple model from my demand, you can look at this small design. The run order of cells that support branching(multi-edge) is an extension of the run order of linear cells. Therefore, it is necessary to maintain the operating environment of linear cells. For IPython kernel, this is to maintain an IPython kernel context.

In addition, for me, the extension of supporting branches run order is to support the needs of flexible scientific explorations and create a new way of code collaboration.

I mean, if a branching cell is represented as a graph structure of code dynamically extending, does it mean some kind of Git? The difference is , this Git operates in a dynamic interactive environment. I think here we can at least have a reference: a Git in an interactive Python environment, which is completely interactive and internalized, without any call of an external Git, and only depends on IPython kernel and Cocalc manager.

the best mathematical form to support branching function is maybe directed acyclic graph (DAG). Considering the maintenance of dynamic interactive environment, it should be an extended DAG. The following assumes that the cell is the smallest unit of code and cannot be modified. All modification operations of the cell are considered as making two cells and obtaining two branches. This ensures that the branch is the most fine-grained operation and that no other operation will break the context of the branch. (If it is necessary to allow cell modification, I think it can be achieved through more flexible branch design) Consider a simple linear execution. From A cell to B cell, the whole kernel at A starts, and some codes are imported at B:

graph TB
    1[A started] --> 2[B import something]

To be specific, suppose that the numpy library is imported at B, A can only uses simple numerical operations (Python-native), while B supports advanced numerical operations obtained from object combination (higher-class from numpy):

graph TB
    1[A use primitive numerical] -- object combination --> 2[B use compound complex numerical]

Now two scientists have decided to open a branch respectively to explore a big problem. The simplest way is to open another branch from which branch is created (that is, B):

graph TB
    1[A] -- object combination --> 2[B]
    2[B] -- Kernel 1 --> 3[C]
    2[B] -- Kernel 2 --> 4[D]

The multi-kernel approach (the kernel here is another description of linear run order) provides a linear Python interaction environment for these two Python users, and the two scientists share all the context up to the beginning of the branch.

graph TB
    1[A] -- object combination --> 2[B]
    2[B] -- Kernel 1 --> 3[C]
    2[B] -- Kernel 2 --> 4[D]
    3[C] -- succeed --> 5[C1]
    3[C] -- failed --> 6[C2]
    5[C1] -- create new package --> 2[B]
    6[C2] -- end kernel --> 7((X))
    5[C1] -- not allowed --> 4[D]
    6[C2] -- not allowed --> 4[D]
    3[C] -- not allowed --> 4[D]

Then, the scientists of Kernel1 decided to open two smaller branches to solve his task: C1 and C2. Since a new branch is specified, a new run order will be generated. Assuming that the solution of C1 is successful and C2 is failed, then Kernel1 would merge C1 branches and create arrows pointing to B. For all other branches containing B in the run order, C1 shoule be a callable package. Kernel1 will terminate the C2 kernel to save computing resources. Can anyone do cross-kernel operation: let the C series arrow directly point to the D series? This is destructive and may cause conflicts because of unknown contexts. The following is a sketch of handwriting: sketch This is interactive Git 1.0, which has the following necessary components:

A monitor process saves the code of each cell and its position in the graph structure. For the branch creation requirements initiated by any user, monitor can obtain a correct linear code body of cells until the cell at beginning of the branch, and create a real linear run order kernel.
Optional termination on the kernels recycle unused kernel branches, and prompt the user whether to terminate some kernels after branch merging.

Git1.0 still has some problems, in the above example, C1 creates a new package. How and when does D know that it can use this package? I think that all the kernels sharing B have the right to decide whether to use the package. An notification service will be in charge of broadcasting the update of this package to all the kernels, then kernel users import C1 package on their own demand.

Git1.0 is a dynamic graph structure. Problem of How to determine the order of edge creation, I think it doesn't matter, because the kernel is parallel. and the influence of location of edge creation determination, that is, the management work of adding edges to the cell node. I recommend, through the graph model, the number of branches can be measured by the output degree of nodes, and the number of packets available is represented as the input degree of nodes, the management work of cell and edge can be well organized through some existing graph algebra packages.

This is an initial design of Edge-wise cell Git 1.0 .

williamstein commented 1 year ago

@Woomou I greatly appreciate the thought you put into this. In particular, I appreciate your suggestion to think about the problem as being analogous to git branches, since that's also a familiar model for collaboration and problem solving.

A key thing that git has is that branches are always named. That suggests that I add a feature for naming branches in the whiteboard as well. For example, you create a new compute cell [A] and it gets a default branch name of "main". You then create a new cell [B] and an edge [A]-->[B] and this is still part of the main branch. Next when you create a new cell [C] and an edge from [A]->[C], you'll be prompted for a new "branch name" that you have to make up, and let's say you call it "test".
Then you do numerous additional steps from [B] and numerous from [C]. Also, at some point your branch something coming out several steps below [C] and again are prompted for a new branch name, and call that one "algebra" (say).

Now you come back a day later, look at [A] and want to "evaluate all". The button to evaluate has a searchable dropdown and it lists three named branches: "main", "test", and "algebra". You click to select "algebra" and the cells along that line in the graph are evaluated in order, but nothing else.

The above approach is reasonable predictable and familiar. It also suggests constraints that will be imposed when creating new edges between compute cells, e.g., no cycles of any kind (the directed graph should in fact be a tree, like with git commits), and there's a requirement to name the branches, like with git.

williamstein commented 1 year ago

(Aside) I just want to record here that it would also be interested to extend the whiteboard to allow it to view the git commit history of any git repo. I.e., create an autogenerated whiteboard from a git repository, with info in each node about that commit, and maybe a link from there to github (for now) to more details about the commit, etc. Some git repo histories are huge with tens of thousands of commits, so this would be good motivation for making the whiteboard more scalable in certain ways. I've put a ton of work into using windowing/virtualization to make all the other editors in cocalc scale their rendering well, but haven't done this for the whiteboard.

Woomou commented 1 year ago

@williamstein 😊I'm glad you found the value of this model and the possibility of extending this edgewise-jupytercell model to a powerful collaboration tool. If you don't mind, I'm very happy to participate in the development of this feature. I'm an algorithm engineer, and Python and C++are my best. Of course, Cocalc has a lot of TypeScript code, but I don't mind participating in its development as a full stack. This is mainly because I personally assume that Git (imagine that we no longer need to start programs repeatedly, and that data and operations are all in the runtime state) has great potential for collaboration in such an interactive environment. Thorough interaction is better than incomplete interaction.

Woomou commented 1 year ago

The idea of whiteboard+Git can establish an effective order for a course, an experiment or any flexible small project. If a project has thousands of commits, we obviously need to design a flexible memory allocation system to reduce the growing resource burden of a whiteboard instance. Another problem maybe the merge of cells into packages. I mentioned this design in the Git1.0 model above. For a relatively stable series of cells, do we need to merge them into a super large cell (but it will take less screen space in whiteboard rendering) for use? After all, we always have to allocate more space to uncertain problems than those certain.

haraldschilly commented 1 year ago

I tried to follow the explanation, but it somehow blows my mind right now. My main concern is, this might become too complex for a user to build a mental model around, hence not able to use it effectively without too many surprises.

Here is my idea, something completely different:

The main objective is to evaluate more than one cell. Related to that, what's right now "one single cell" , could also be an ordered list of cells in the future, like a mini-notebook. "Evaluating a cell" for the context of what I write here means to run one or more cells inside a node.
There are arbitrary edges connecting nodes containing such cells. Any outgoing edge to another type of node is ignored (e.g. cell → text)
In the action bar of a jupyter node, there is a button "Run" right now. I suggest to introduce another button called "Run strand", just like in the usual notebook has "run all". ["Run chain" also comes to my mind … i.e. my point is just to add another button with a descriptive name]
So, the main thing is, what happens if someone clicks on "run strand"?
1. make sure the current kernel is started. There is just one kernel overall as it is right now.
2. Initialize an empty set of evaluated nodes. Each time a node is evaluated, we check if that node is in that set. If so, don't evaluate – otherwise add it to the set.
3. Run all the cells in the node.
4. Check outgoing edges of the current node for jupyter nodes it points to. If there is more than one, sort by vertically (top to down) and if there is a tie, left to right. Pick the first/next node and start evaluating it – i.e. this is more or less a recursive logic.

Behavior:

a linear chain of cells runs along the edges.
If there is one cell at the top and two or more "strands" of nodes are originating from there, evaluations happen in each strand, one strand after the other. If you run the common ancestor cell, all strands are evaluated. Or you could just evaluate one strand without the others being run (like in the linear case above) – I call this the "star pattern", because there is a center with radiants originating from there.
Another outcome is kind of the other way around: several strands are at the top, and at the bottom is a common child cell. Those cells above could e.g. loading different models, different functions, etc. (they define the same variable names, but with different code/data). the common child at the bottom could be a visualization (using these objects with the same variable names). I.e. that way, you could rather quickly build alternative chains of setup-cells as precursor for the "actual calculation" you want to check out. let's call this the "attractor pattern", since several strands end up in the center.
A fork of nodes which merge later (still a DAG), doesn't make much sense, though. First one strand runs to the end, then the "side nodes", where the fork is, will run. I don't think this will be used, but I'm still adding this to clarify how the algorithm above will play out.
A circle of nodes runs around once, then stops.

Woomou commented 1 year ago

Surprise me! In your very modular scheme, a single kernel can meet the demand of maintaining an interactive environment, but I want to know whether the creation and revocation of any number of nodes can be logically unified in the single kernel? Especially, when some nodes contain at least one self loop body.@haraldschilly

williamstein commented 1 year ago

We had a brainstorming discussion and decided one safe way to do this would be to simply require the user to check a box to resolve ambiguity:

sagemathinc / cocalc

When we can specify the order in which multiple code cells run by using Edges? #6168