martinvonz / jj

A Git-compatible VCS that is both simple and powerful
https://martinvonz.github.io/jj/
Apache License 2.0
7.66k stars 256 forks source link

Clarify what topological order means for commits #3197

Open emesterhazy opened 5 months ago

emesterhazy commented 5 months ago

@martinvonz what do you consider to be the "canonical" topological order for commits in the repo? I would expect that it is parents before children, where the DAG is written with edges from parents to their children. That definition matches the one used in Index::topo_order here:

https://github.com/martinvonz/jj/blob/96bf19023475e6a208524f20af7ed0a28effe547/lib/src/index.rs#L90-L91

But we also have conflicting explanations in Revset:

https://github.com/martinvonz/jj/blob/96bf19023475e6a208524f20af7ed0a28effe547/lib/src/revset.rs#L2402-L2407

I am pretty sure that iter works in reverse topological order since it seems like the positions are reversed based on a quick survey of some of the implementations:

https://github.com/martinvonz/jj/blob/96bf19023475e6a208524f20af7ed0a28effe547/lib/src/default_index/revset_engine.rs#L369-L381

https://github.com/martinvonz/jj/blob/96bf19023475e6a208524f20af7ed0a28effe547/lib/src/default_index/revset_engine.rs#L475-L487

And we can also see that the order of the positions is reversed before they're baked into EagerRevset.

https://github.com/martinvonz/jj/blob/96bf19023475e6a208524f20af7ed0a28effe547/lib/src/default_index/revset_engine.rs#L929-L934

I'm curious why we use reverse order for Revset. Either way, I think we should clearly label one of these orderings as "reverse topological order" to avoid confusion.


I think it might also be better to add a fn iter_topo to the Revset trait and implement fn iter by reversing the iterator returned by iter_topo. Assuming the implementations aren't relying on the reversed order internally, it might be good to eliminate all of the reversing in the revset engine since it seems easy to make a mistake and forget to reverse the order. There are 7 implementations already, so doing the reversing at the top would give better confidence that it's being done consistently. This isn't super important obviously, but it might be good for code health.

yuja commented 4 months ago

I think we should clearly label one of these orderings as "reverse topological order" to avoid confusion.

+1

In DVCS context, I expect (unqualified) "topological order" is reversed, but explicit "reverse topological order"/"forward topological order" should be better.

I think it might also be better to add a fn iter_topo to the Revset trait and implement fn iter by reversing the iterator returned by iter_topo.

The underlying data model has self->parents links, so it's no go to reverse the source ordering.

FWIW, I think Index::topo_order() can be removed. There's only one caller, and it can be converted to a revset.

emesterhazy commented 4 months ago

Thanks Yuya ~

In DVCS context, I expect (unqualified) "topological order" is reversed, but explicit "reverse topological order"/"forward topological order" should be better.

So just to be clear, you're saying that "topological order" should put the children before the parents, where the graph has edges from the children to the parents. I guess that makes sense because the parent commits don't know about their future children when they're written, so it's natural to think of the edges being from the children to the parents, even if the resulting ordering is reverse chronological.

The underlying data model has self->parents links, so it's no go to reverse the source ordering.

Got it. The thing that confused me is that it looks like we're reversing the ordering in the various revset implementations, but maybe that is to build the underlying data model as you say.

FWIW, I think Index::topo_order() can be removed. There's only one caller, and it can be converted to a revset.

That's probably a good idea as well then in addition to clarifying what topo order means for Revset::iter.

yuja commented 4 months ago

So just to be clear, you're saying that "topological order" should put the children before the parents, where the graph has edges from the children to the parents.

My feeling is that:

FWIW, log --reversed means "forward topological order". :)

martinvonz commented 4 months ago

So just to be clear, you're saying that "topological order" should put the children before the parents, where the graph has edges from the children to the parents.

My feeling is that:

  • "topological order": maybe children before the parents?, but look in the implementation to be sure
  • "reverse topological order": children before the parents (= natural order in DVCS context)
  • "forward topological order": parents before the children (= reversed order implementation-wise)

FWIW, log --reversed means "forward topological order". :)

OTOH, the default jj log order seems to be the regular (forward) order according to https://en.wikipedia.org/wiki/Topological_sorting. So maybe we should just follow that definition consistently?

yuja commented 4 months ago

Perhaps, things are confusing because roots and heads are defined differently?

martinvonz commented 4 months ago

Perhaps, things are confusing because roots and heads are defined differently?

Hmm, good point. I'm not sure I had realized that there's a standard definition of those terms.

emesterhazy commented 4 months ago

I am a little confused by this discussion. Topological sort is defined for a directed graph, and I'm not sure certain which commit (parent or child) you are saying is the source and which is the target.

Intuitively, I would expect a DAG where the parent commits have edges pointing to the children, and the repository root is the root of the DAG. In this conception the source of an edge is the parent and the target is the child (Parent -> Child).

However, our data structures work the opposite way where child commits have pointers to their parents, and in that case there may be many roots of the DAG and only one head, which is the repository root. In this conception the source of an edge is the child and the target is the parent (Child -> Parent).

I think we can pick either. Just because the Commit struct doesn't have pointers to its children doesn't mean that there isn't an edge in the graph conceptually. For what it's worth, I think that defining the DAG to have a single root (the repo root) with edges from parents to children is most intuitive. But maybe there is precedence in the DVCS world for it to be defined as the inverse of that.

Which of the two makes the most sense to you?

martinvonz commented 4 months ago

I think we can pick either. Just because the Commit struct doesn't have pointers to its children doesn't mean that there isn't an edge in the graph conceptually.

That SGTM.

jj log --reversed is then a bit confusing, but probably just a tiny bit because it doesn't say that it's reverse topological order. Speaking of that maybe jj log should specify the order it prints nodes (i.e. in reverse topological order if we go with the Parent -> Child definition).

emesterhazy commented 4 months ago

Wouldn't jj log be printed in topological order if go with (Parent -> Child) since the root is at the bottom and all of the lines point upwards towards the children? I guess it depends on your frame of reference (and really, which way the edges point), and I know that normally trees are written with the root at the top and jj log is the opposite.

But at least to me, this looks like the edges point upwards from the root:

jj log -T ""
@
◉
│ ◉
├─╯
◉
╷ ◉
╭─╯
◉
╷ ◉
╷ ◉
╭─╯
◉
│

And this looks like they point downwards:

jj log --reversed -T ""
◉
├─╮
╷ ◉
╷ ◉
◉
├─╮
╷ ◉
◉
├─╮
│ ◉
◉
@

I guess this would be going against convention a bit, but maybe that's ok if it makes the most sense in this context.

joyously commented 4 months ago

This all sounds vague and implementation dependent. If the graph is viewed as the evolution of changes, it's obvious that whatever came first flows toward what it became next. (the arrow of time and dependency) As stated previously, parents can't point to children because the child doesn't exist at the time the parent is created. Showing the most recent first by default makes sense for the most use cases. This also meshes with how the pointers would work (child pointing to parent). The --reverse is logically the opposite of the default (oldest first).

emesterhazy commented 4 months ago

Sorry Joy, can you clarify what you're advocating for? The arrow of time flows from the parent to the child typically, but the arrow of dependency always flows from the child to the parent since you cannot have a commit without a parent other than the special repo root.

The issue is that when we order the graph from child to parent, the repository root is no longer a root of the DAG (node with indegree zero), and is instead a leaf/head. I think that is the fundamental conflict in terminology.

So are you saying that you think the edges should be (child -> parent) or the opposite?

joyously commented 4 months ago

What I'm advocating for is to quit mixing the graph presentation with description of the code. There are implementation details, and that should be in the comments. Outside of that, the graph the user sees should accurately represent the changes as they were made in time (or later positioning). If the "topological order" phrase doesn't make sense, find a different phrase that does. It doesn't really matter how it is implemented, as long as it works and the comments are clear on what it is doing.