qwertie / ecsharp

Home of LoycCore, the LES language of Loyc trees, the Enhanced C# parser, the LeMP macro preprocessor, and the LLLPG parser generator.
http://ecsharp.net
Other
172 stars 25 forks source link

LES3: Faithful storage of acyclic graphs? #75

Closed qwertie closed 4 years ago

qwertie commented 5 years ago

Loyc trees aren't really trees, they are directed acyclic graphs (DAGs). Should the text format be able to represent such graphs faithfully? I recall defining a mechanism with the syntax (@@ x + y) to mark an expression in which the parentheses perform grouping but do not produce trivia for the parentheses. One simple implementation would use this as a hint to deduplicate shared subtrees, like so:

(@@ x+y) blah blah(z, (@@ x+y))

The parser would have to add trees marked with @@ to a hashtable to detect duplication. Note that source code position information would be discarded for one of the subtrees.

To scale this up to large shared subtrees, we can introduce unique ID codes with a syntax like this:

(@@hash large subtree here) blah blah(z, @@hash)

where hash represents a hashcode. I thought of requiring a valid identifier after @@ and using base64 codes to represent IDs, with A-Za-Z0-9, # and _ as the 64 digits. But this doesn't give a normal identifier in case the code starts with 0-9. But I guess any identifier can be allowed (e.g. @@`123`) and a parser need not care whether it is base64 or not.

Supporting this feature in the printer would probably reduce performance and it certainly reduces the file's readability, particularly when hashes are used, so it'll need to be configurable in the printer. If the shared trees are very small, this particular syntax is cumbersome, but if we're sticking to ASCII, I don't see an alternative.

// A decidedly cumbersome example
blah1(@@ @(@@ x) foo((@@ x)) )
blah2(@@ @(@@ x) bar((@@ x)) )
jonathanvdc commented 5 years ago

What would the use case for this feature be? Do you want YAML-like references to support things like config DSLs or do you want to more efficiently represent syntax trees by turning them into DAGs?

The reason I'm asking is because I've experimented with the latter when I was working on a potential successor to my binary Loyc tree format. (I lost interest after a while though.) My hunch at the time was that Loyc trees could be stored efficiently by first de-duplicating subtrees, effectively turning the Loyc tree into a DAG, and then storing the DAG in a binary format.

I guess this brings me to a different but sort of related point: what are Loyc's rules for node deduplication/duplication? More generally: what constitutes a faithful representation of a Loyc tree? Is discarding trivia okay? Because that's what all my binary formats do. AFAICT, there are no such rules at the moment, but having them would be super useful for defining formats for Loyc trees.

qwertie commented 5 years ago

What would the use case for this feature be?

Well, it's not so much about use case as ideology - I have an ideology that if something is possible, and not overly difficult, it ought to be done/supported. You can see that in my AList data structure, where it basically can do everything a B+tree-like data structure can realistically do. And.... I've got like no users for it, but I think it's a great data structure and if I had a job coding C# I'd be using it often. It's just nice to have flexibility, so whatever you happen to want to do, you can do it.

Generally I want round-tripping to text to be as faithful as possible (a binary format could probably do even better, by preserving source filenames and ranges, which seems too unweildy to support in LES, particularly if it's limited to ASCII 32-127).

Admittedly there was a use case that brought this idea to mind: projectional editing. Now, I'm not sure, maybe it only really makes sense if the Loyc tree implementation supports mutability in some way, but I figured that maybe in a projectional editor, the usual "tree" structure becomes a Directed Acyclic Graph, because the editor might have reason to replicate parts of the tree in multiple places to provide more detailed "views".

I don't know exactly how it might work - I haven't given it much thought - but perhaps instead of a traditional "traits" implementation in which you define a trait function T.f inherited by multiple classes, the function f simply happens to exist in both classes at once and when you edit one, you're also editing the other. And maybe in such an editor, the base class and interface "declarations" actually just contains the definitions themselves.

Deduplication can be used for "compression", as you mentioned. And by the way, once a compressor exists, you can use it as a code quality metric. You'd measure code quality by calculating the smallest possible output text size after deduplication - the smaller the ratio between output and original code size, the lower the code quality is (why text size? Because the fact that you can deduplicate small nodes is irrelevant to code quality, and the proposed LES deduplication syntax is kind of heavyweight so it tends to actually make small nodes bigger after deduplication.)

what are Loyc's rules for node deduplication/duplication

We kind of make up the rules as we go along, eh? My intuition says that for the binary format, you should by default preserve as much information as possible, so don't deduplicate unless asked (as it would change the structure), and only discard trivia if requested. I would think a binary format is what I would use if I needed the most perfect round-tripping (while LES tries to preserve trivia, it may do so imperfectly.)

qwertie commented 5 years ago

Modification to proposal

As I was exploring an idea this morning I decided that it would be better to have a separate syntax for creating references vs using references so that using a previously-defined reference can be shorter. In the above proposal you could write nonfood(@@foo "f"+"oo"); food((@@foo) + "d") where (@@foo) retrieves the previously defined syntax tree called "foo", but parentheses are required to indicate that a new tree is not being created.

I'd like to modify the proposal so that re-using a reference is easier: @@@ creates a reference and @@ refers back to it: nonfood(@@@foo "f"+"oo"); food(@@foo + "d").

Three @ signs is a bit clumsy, but no obvious alternative (that is neither ambiguous nor ugly) comes to mind at the moment. (One possibility is nonfood(@=foo "f"+"oo"); food(@@foo + "d") which would be unambiguous because there is no prefix operator called =, but it looks a little odd to me...) Edit: another possibility is @.foo, which is unambiguous since a dot-keyword can't be used in a attribute context. This syntax seems good because I think it's more likely to someday add a prefix-= operator than to want dot-keyword statements as attributes (a dot-keyword expression requires spaces and attribute-expressions are compact and prohibit spaces.) On the other hand, I guess someone might misinterpret @.foo x = y as @(.foo) x = y, or even as @(.foo x = y).

Wait a minute... do we really need SSA?

The inspiration for the above change was that I was looking at Cranelift IR this morning and I thought... wait a minute, isn't this "Single Static Assignment" business actually just making something simple look more complicated than it is?

The Cranelift page has this example:

float average(const float *array, size_t count) {
    double sum = 0;
    for (size_t i = 0; i < count; i++)
        sum += array[i];
    return sum / count;
}

which produces this long hunk of code:

function %average(i32, i32) -> f32 system_v {
    ss0 = explicit_slot 8         ; Stack slot for ``sum``.

ebb1(v0: i32, v1: i32):
    v2 = f64const 0x0.0
    stack_store v2, ss0
    brz v1, ebb3                  ; Handle count == 0.
    v3 = iconst.i32 0
    jump ebb2(v3)

ebb2(v4: i32):
    v5 = imul_imm v4, 4
    v6 = iadd v0, v5
    v7 = load.f32 v6              ; array[i]
    v8 = fpromote.f64 v7
    v9 = stack_load.f64 ss0
    v10 = fadd v8, v9
    stack_store v10, ss0
    v11 = iadd_imm v4, 1
    v12 = icmp ult v11, v1
    brnz v12, ebb2(v11)           ; Loop backedge.
    v13 = stack_load.f64 ss0
    v14 = fcvt_from_uint.f64 v1
    v15 = fdiv v13, v14
    v16 = fdemote.f32 v15
    return v16

ebb3:
    v100 = f32const +NaN
    return v100
}

And I thought, wait a minute, why do we need SSA variables? They can only be assigned once, which actually means they are like nodes in a DAG - they only have one identity, and you can just have pointers to those nodes instead of referring to them by name. As you can see, all of the variable names here are utterly meaningless, so it wouldn't hurt to elide them. So I tried rewriting this code in LESv3 notation based on this DAG proposal (edit: this was misguided, see below). The result is much more compact and, if I just rename a few variables, it's straightforward to figure out what the code does:

@system_v .function $average(i32, i32) -> f32 {
    sum = explicit_slot(8)         // Stack slot for `sum`.

  ebb1(@@@array array:i32, @@@count count:i32):
    stack_store(0.0, sum)
    brz(@@count, ebb3)
    jump(ebb2(0))

  ebb2(@@@i i:i32):
    stack_store(fpromote_f64(load_f32 (@@array iadd (@@i imul_imm 4))) fadd stack_load_f64(sum), sum)
    brnz(icmp_ult(@@@11 @@sum iadd_imm 1, @@count), ebb2(@@11))
    return(fdemote_f32((stack_load_f64(sum)) fdiv (fcvt_from_uint_f64(@@count))))

  ebb3:
    return(f"+NaN")
}

Edit: for the reasons stated below, using the new DAG feature for this is the Wrong Thing To Do. A better representation would look more like traditional SSA syntax, but would only introduce SSA variable names for values that are used more than once (or used across block boundaries):

@system_v .function $average(i32, i32) -> f32 {
    sum = explicit_slot(8)         // Stack slot for `sum`.

  ebb1(array:i32, count:i32):
    stack_store(0.0, sum)
    brz(count, ebb3)
    jump(ebb2(0))

  ebb2(i:i32):
    stack_store(fpromote_f64(load_f32 (array iadd (i imul_imm 4))) fadd stack_load_f64(sum), sum)
    brnz(icmp_ult(v11 = sum iadd_imm 1, count), ebb2(v11))
    return(fdemote_f32((stack_load_f64(sum)) fdiv (fcvt_from_uint_f64(count))))

  ebb3:
    return(f"+NaN")
}

This is isomorphic to the original code, but the order of operations is implicit. I guess there is a problem with this representation - that the full list of intructions is left unstated, and would have to be reconstructed. On the other hand, in standard SSA notation, the direct pointers between variables are left unstated and would perhaps have to be reconstructed (or maybe not, I've never written a backend, maybe the pointers are irrelevant and the variable labels are actually the important thing?)

Anyway, it seems like a win to me. Edit: and it's a bit more readable if the binary operators use punctuation (combo operators like i+ rather than iadd`):

@system_v .function $average(i32, i32) -> f32 {
    sum = explicit_slot(8)         // Stack slot for `sum`.

  ebb1(array:i32, count:i32):
    stack_store(0.0, sum)
    brz(count, ebb3)
    jump(ebb2(0))

  ebb2(i:i32):
    stack_store(fpromote_f64(load_f32 (array i+ (i ii* 4))) f+ stack_load_f64(sum), sum)
    brnz(icmp_ult(v11 = sum ii+ 1, count), ebb2(v11))
    return(fdemote_f32((stack_load_f64(sum)) f/ (fcvt_from_uint_f64(count))))

  ebb3:
    return(f"+NaN")
}
jonathanvdc commented 5 years ago

Hi!

What you're proposing is interesting and not entirely unprecedented. It looks an awful lot like a sea of nodes IR 1, which is actually used in the JVM 2. LLVM's selection DAG takes a similar approach 3.

I don't have firsthand experience with the sea-of-nodes representation. It's kind of similar to an SSA IR, which I do have some experience with: I'm using SSA IRs in basically all of my new compiler projects, including my Flame rewrite—SSA simplifies almost every optimization you can think of. Case in point: the Flame rewrite's global variable numbering (GVN) implementation is just 68 lines of code, including comments, whitespace, using directives and other formulaic cruft 4. Well, the transform is 68 lines of code, the logic that determines instruction equivalence is more complicated because there's no one right way to do that.

Anyway, back to the issue at hand. I suspect that almost all SSA optimizations still work for a sea of nodes IR. My reasoning for going with "standard" SSA (with labels) instead of a sea of nodes IR was that labels make it very easy to replace instructions. Take Flame's GVN for example, which performs the following simplification:

z = intrinsic(@arith.add, System::Int32, #(System::Int32, System::Int32))(x, y);
w = intrinsic(@arith.add, System::Int32, #(System::Int32, System::Int32))(x, y);
-->
z = intrinsic(@arith.add, System::Int32, #(System::Int32, System::Int32))(x, y);
w = copy(System::Int32)(z);

So, to change how a value is computed, all you need to do is update the instruction that updates the value. Since other instructions refer to the value, they aren't affected at all. In this case, copy "copies" z's value, but all copies are always elided by the copy propagation pass which runs later on, so there's no real overhead associated with a copy. The source code for the substitution performed by Flame's GVN is

insn.Instruction = Instruction.CreateCopy(insn.Instruction.ResultType, equivValue);

Things would be different in a sea-of-nodes IR (or any other IR where instructions directly reference each other, like LLVM IR). To change how a value is computed in such an IR, you'd have to rewrite all instructions that reference the value because instructions directly refer to other instructions there. That's a lot of work for a single instruction update and compilers perform lots of instruction updates. For example, the pattern below is quite common in code that manipulates LLVM 5.

new_call = call!(builder, to, args) # Create a new call instruction.
replace_uses!(call, new_call) # Replace all uses of the old call with the new call. 
unsafe_delete!(LLVM.parent(call), call) # Delete the old call.

I imagine that sea-of-nodes IRs might be genuinely useful for instruction selection. Moreso than a standard SSA IR. Coincidentally, that also explains the design of LLVM's selection DAG. I don't think that a sea-of-nodes IR is enough to stop an instruction selector from decaying into the unholy mess that so typifies code generation logic, but it might help.

qwertie commented 5 years ago

Hmm, I'm new to backend stuff, but it sounds like maybe what is called for is that instead of instructions referring directly to results of other instructions, as in a * b + c where + refers directly to the result of *, the IR should have indirection, so the + node refers to two heap blocks (which we might call "SSA variables") which, in turn, refer to the left and right subexpressions:

class SSAVar {
  IRExpr expr; // mutable
}

Due to lack of experience I may be wrong about the appropriateness of this.

However, this issue is orthogonal to the issue of how the text format looks - the text could still be written as a * b + c rather than v1 = a * b; v2 = v1 + c.

qwertie commented 5 years ago

I gave it some thought and realized that the LESv3 DAG/deduplication system should not be used for this purpose, because it violates my own intuition about how it ought to work.

First of all, if I write (@@@sum a + b) * @@sum, I would expect a compiler to typically evaluate a + b twice (optionally optimizing the code to eliminate one of the evaluations), whereas the implication in my description above is that that it must be evaluated only once.

Second, making the semantics implicitly dependent on graph deduplication is horrible in the presence of mutable state. Consider code like this:

    x = explicit_slot(4)
ebb1:
    stack_store(1, x)
    @@@tmp stack_load_i32(x)
    stack_store(2, x)
    return(@@tmp)

This should return 1, not 2, but the LNode for the last line literally just represents return(stack_load_i32(x)). Clearly this is wrong and a Bad Idea and traditional SSA form makes a lot more sense here:

    x = explicit_slot(4)
ebb1:
    stack_store(1, x)
    tmp = stack_load_i32(x)
    stack_store(2, x)
    return(tmp)

So I will edit my previous post to reflect this.

jonathanvdc commented 5 years ago

Hmm, I'm new to backend stuff, but it sounds like maybe what is called for is that instead of instructions referring directly to results of other instructions, as in a * b + c where + refers directly to the result of *, the IR should have indirection, so the + node refers to two heap blocks (which we might call "SSA variables") which, in turn, refer to the left and right subexpressions

class SSAVar {
 IRExpr expr; // mutable
}

However, this issue is orthogonal to the issue of how the text format looks - the text could still be written as a * b + c rather than v1 = a * b; v2 = v1 + c.

I agree. But how is that any different from having labels, apart from how you print the text format? Because your definition of SSAVar is exactly how you'd represent a labeled SSA instruction in a mutable SSA IR. Or is your point that a standard SSA IR becomes easier to read when printed using your proposed format, even though the in-memory representation of the IR is unchanged?

qwertie commented 5 years ago

Yes, that's my point - it's not about changing the internal representation of SSA, it's that all these meaningless variable names impede readability and that all those assignments lengthen the code, so it would be better to use nested expressions. And traditional assembly languages could work similarly. Like instead of

mov ecx, [edx]
cmp eax, ecx
jge  end_loop

it could be

eax <=> (ecx = *edx)
.jge end_loop

And with LES it would actually be easier to process this because you don't have to write a parser. (You have to linearize the code but that's pretty easy.)

jonathanvdc commented 5 years ago

I could definitely see how that could be useful, especially for manual assembly programming. Having a high-level notation for assembly language instructions might really take some of the pain out of low-level programming.

qwertie commented 4 years ago

This has been implemented for v28.0, but only in the parser.