qwertie / ecsharp

Home of LoycCore, the LES language of Loyc trees, the Enhanced C# parser, the LeMP macro preprocessor, and the LLLPG parser generator.
http://ecsharp.net
Other
172 stars 25 forks source link

Define a binary format for Loyc trees #17

Open qwertie opened 8 years ago

qwertie commented 8 years ago

To redirect discussion from #16

jonathanvdc commented 8 years ago

I'll try to elaborate on my thought process concerning that micro-optimization, which is, arguably, a bit of a wart on the file format.

It's fair to ask if it's the "right" decision to optimize for cases like f(1, 2); f(3, 4); f(5, 6) in which f is repeated, but not optimize for cases like f(1, null); g(x, null); h(g(), null) in which an argument is repeated (in BLT, f is deduplicated while null is not). To be sure, it's not a bad decision; it just makes me wonder if there is a better approach that could optimize more cases.

I absolutely agree that the snazzy macro-engine you seem to advocate could reduce file size further. I considered doing something similar when I was designing BLT, but I didn't implement those plans, because these things typically place a heavy burden on the file writer, both in terms of implementation costs and run-time costs.

Specifically, inferring macros (especially non-nullary macros) from a tree-based data-structure of arbitrary depth may turn out to be a non-trivial task. And as you pointed out in your example, poorly executed optimizations may actually turn out to be pessimizations. I've never implemented anything like this before, but, to me, "getting it right" seems like a fairly CPU-intensive process.

And that's perfectly fine, if the file - that the writer has so painstakingly constructed - will be used, copied or downloaded lots of times. It's less than fine, though, if you just want a bunch of temporary files that are only going to be used once. I like to think that the micron standard library exemplifies both scenarios. (micron's a pure functional language that some my friends and I created over the course of last semester. Its compiler - muc - is Flame-based, as you might expect.) I'll copy-paste the stdlib build script here:

# Builds the micron standard library
# This requires:
#   1) dsc - the D# compiler (https://github.com/jonathanvdc/Flame)
#   2) muc - the micron compiler

# First, compile the primops, primio and primlist modules
dsc primops.ds -platform ir -runtime clr -o bin/primops.flo -repeat-command $@
dsc primio.ds -platform ir -runtime clr -o bin/primio.flo -repeat-command $@
dsc primlist.ds -platform ir -runtime clr -o bin/primlist.flo -repeat-command $@
# Then, compile the stdlib module, and link it with the primitive modules.
muc stdlib.mu bin/primops.flo bin/primio.flo bin/primlist.flo -platform ir -runtime clr -flower-lambda -o bin/stdlib.flo -repeat-command $@

The resulting bin/stdlib.flo file can then be statically linked with an arbitrary micron program by calling muc program.mu <path>/stdlib/bin/stdlib.flo -platform clr. At this point, the following conflicting use-cases arise:

My reasoning for the call-id optimization was that most call nodes are calls to constant identifiers. In fact, all call nodes in Flame IR are of this form. So I figured that adding a call-id template would compress files, without involving any complicated preprocessing steps. This works for both scenarios, as it reduces file size without sacrificing write performance.

Also, Wasm discussions make me wonder whether the binary format should be canonical and "multi-level", i.e. have a multi-stage encode/decode pipeline.

The multi-layered approach Wasm seems to be taking, in an effort to address what is essentially the same problem, looks promising. I'm not fundamentally opposed to integrating something similar in the "next-gen" binary format for Loyc trees. I do have the following concerns, though:

This makes sense to me, since their level 0 is the "uncompressed" binary form, and canonicality could allow "directly" comparing binaries produced from different sources.

Yeah, but that won't stop anyone from representing semantically equivalent information as syntactically distinct Loyc trees. For Wasm, the implication is that binaries generated by clang/binaryen and gcc (when it gains Wasm support), will appear to be distinct, even though they do the same thing.

I'm curious, what do you do for a living?

Currently studying for my bachelor's degree in computer science.

How did you fix your problem (building on Linux)?

A post-build step included some rem commands. bash didn't know what to do with them, and xbuild did a poor job of explaining that to me. Everything worked fine once the rems had been banished from the realm.


How about the following layout for a binary Loyc tree encoding?

  1. Magic number
  2. String table
  3. Template+macro table (without call-id templates)
  4. Contents

I suppose the advantages of this layout are that it's amenable to efficient, single-pass reads, and it makes layer 0 equivalent to not defining any macros at all (and possibly ordering strings and templates, too). A range optimizers could later be designed, each making a different write time/file size trade-off. In fact, I'm pretty sure the current call-id optimization could be re-implemented as a number of macros instead of call-id templates.

Oh, should we allow macros to use other (previously defined) macros? This could come in handy if I actually do decide that I'd like to re-implement the call-id optimization.

qwertie commented 8 years ago

I think they're trying to design Wasm so that the layers are pipelined, so that there are not multiple passes over the data. If each layer has a simple decoder, the two layers will not be slower than one big layer. Wasm will probably be severely biased to favor decode speed (and small file size) over encode speed (since decoding tends to be done far more often) but layer 0 alone will be easy to encode, so that in cases where encoding is as common as decoding, you could leave out layer 1 and get high-speed encoding. On top of layers 0 and 1, a conventional compression algorithm will be added (called layer 2, I think; compressed size is super important to them because they want to convince the world that Wasm is better than compressed & minified asm.js).

Hmm. I wish I had more insider knowledge of Wasm. It's been hard to learn about it because, well, there aren't any good overviews of Wasm, and so far I haven't seen a single real-world-but-relatively-simple Wasm example. All that seems to exists is unit tests on one hand, and massive auto-generated modules produced by the likes of Emscripten - I tried a trivial example in Emscripten and got a monstrous 9000-line output file, and wtf, the devs imply there's no way to get smaller output.

I can't be sure without knowing more, but maybe Wasm will be defined in such a way that its file format could be adapted to store Loyc trees instead of Wasm trees with minor modifications. If that were the case, then there would be a chance of merging Wasm and binary Loyc trees into two closely related formats - yeah, they're not inclined to pay attention to a small fry like me, but it's worth considering because

  1. Wasm is potentially the basis for a new-and-improved Loyc project, and
  2. Being a part of something as important as Wasm is very exciting.

Setting Wasm aside, I gave some thought yesterday about I might design a binary Loyc tree format that was simple to decode but potentially highly compressible... I haven't worked out a satisfying solution yet. So for now I'll just say, I think macros re-using older macros sounds promising.

What am I looking at in those *.ds files? EDIT: Oh I see, ds = D#. I wonder how that's going...

jonathanvdc commented 8 years ago

Oops. Seems like it's been a few days. Sorry about that. I've been keeping busy lately: working simultaneously on proper SSA-based -O3 optimizations for Flame, and compare-test: a simple testing program that runs command-line programs, and then compares their output to each other, and/or to some reference output.

Well, I absolutely agree that the whole "layer 0/layer 1" approach has a number of theoretical advantages. However, I'm also under the (possibly false) impression that layer 0 and layer 1 would actually be alternatives, because - like layer 0 and unlike layer 2 - layer 1 needs to know about the fine details of a Wasm module's layout to perform structural compression. This implies that a layer 1 decoder returns a tree-based structure (it has to construct a tree for macro application purposes anyway, right?), just like the layer 0 decoder. In terms of function types, 0 + 2 and 1 + 2 layer compositions make perfect sense, but layer 0 + 1 compositions don't:

// No compression
readLayer0 : byte list -> Tree
// Structural compression
readLayer1 : byte list -> Tree
// Generic compression (zip)
readLayer2 : byte list -> byte list

// Layer 0 + 2 (zipped Wasm)
readLayer2 >> readLayer0 : byte list -> Tree
// Layer 1 + 2 (zipped and structurally compressed Wasm)
readLayer2 >> readLayer1 : byte list -> Tree

readLayer1 >> readLayer0 : ???

Now, I agree that the layered approach has its merits, because layer 1 + 2 files would be ultra-compressed, and layer 0 files would be really fast to encode. But wouldn't a poorly compressed layer 1 file be more or less equally easy to encode? So why keep the extra layer 0 around, then? Sure, layer 0 files may be slightly easier to encode (though I expect coding a naive layer 1 writer to be a more or less manageable task), but this comes at the steep price of forcing Wasm implementors/toolchain developers to maintain two separate parsers for (slightly) incompatible formats that encode the same type of data.

Or, maybe I just don't fully understand the inner workings of layers 0 and 1, and how they interact. Do you, by any chance, happen to know of a prototype that features both layers? I've looked at js-astcompressor-prototype, which seems very layer 1-ish - and, by the looks of it, is bypassing layer 0 entirely. v8-native-prototype and the binary encoding design document are layer 0-ish, but neither seem to specify exactly how a hypothetical layer 1 actually works.

Wasm also seems to be moving toward defining a fairly rigid set of opcodes. Whatever happened to the idea of extensible "opcode tables?" Doesn't this new trend compromise post-MVP Wasm a little bit? What happens when the Wasm group decides to add more and more operators? I'd hate to see x86-worthy opcode lengths in a portable assembly language that's supposed to last "forever."


(A subset of) the extensible binary encoding that kg initially proposed seems like something that could be used to encode Loyc trees. I'm not so sure about the format detailed in the current binary encoding design document, though. Granted, Wasm is a really cool project, but I'm not sure if it's evolving in the right direction for a Wasm/Loyc binary format synthesis to happen.


Can't say that I've ever used Emscripten before, despite the fact that it looks like an awesome project. What you're describing seems like a classic dead function elimination problem, though that may not be as trivial as it sounds, depending on the circumstances.


Oh, and D# is doing fine, I guess. I've mostly been working on Flame's middle-end lately. The D# front-end may be in need of a rewrite at some point to streamline the language, refine diagnostics, and catch up with improvements in the middle-end. But that's probably going to be a lot of work. So, for now, it'll just have to limp along.


Implementing a macro-based binary Loyc tree format sound like a fun challenge. By the way, do you think it's worthwhile to take things like billion laughs attacks into account when designing such a format?

qwertie commented 8 years ago

Hi Jonathan, I too am sorry to have taken so long to respond. But at least for part of that time I was productive, as I finally made that pattern-matching macro for LeMP. I think I mentioned I was planning to implement this for months, but was blocked by a bug in LLLPG. I have not fixed the bug, and the EC# parser is still slightly broken, but it turns out that I could ... sort of ... use the bug in a certain way that most source code is still compiled successfully. I hope to write an article about it today, but in the meantime one can see how it works by looking at the source code.

In Wasm, layer 1 indeed knows about the fine details of a Wasm module's layout, but the plan is that it will produce layer 0 code as a byte stream. Perhaps you're right that going directly from a layer-1 stream to a data structure would be more efficient, but that's not possible until layer 1 is standardized (since Wasm's internal data structure is not standardized, it is not possible as an output format). So... point taken.

And I probably won't have much time to think more about a Loyc binary format soon, as I'll be working one last time to promote LeMP and then ... let's see,

Whew! Lots of work. So... billion laughs attack? It seems natural to design a format in which such an attack is possible - ideally the runtime environment could defend against it. I've always thought .NET should provide better sandboxing support, and in particular, sandboxing of memory allocation and opt-in sandboxing--code that says I want to live in a sandbox. Some people at MS seem to think about sandboxing the wrong way, they think of it as a restriction that reduces what your code can do, that you should be happy to be rid of (example: I reported a vulnerability in dynamic methods - if my code itself has full permissions and bypasses CIL verification, then any dynamic methods I produce also bypass CIL verification, and this was undocumented. I explained that this was a security vulnerability as well as an annoyance, but they seemed unconcerned). I see sandboxing as a great feature that allows you to process unknown, possibly malicious data without worrying that a bug in your code will allow the malicious data to do evil deeds. In the case of a BLT or XML decoder, if the decoder can put a limit on its own memory use, the problem is contained. It should be able to ask its environment "throw an exception if I allocate more than 1000x the file size!" Importantly, if this is a feature of a sandbox rather than the parser, then the code that invokes the parser can use it, even if the parser itself doesn't defend against the problem. Of course, this is not possible in .NET.

qwertie commented 8 years ago

Ping. Any objection if I rename RVList<LNode> to VList<LNode>? It's bad enough that there is a special list type... but I dread explaining its name. (May as well ping here, you're the only confirmed user) EDIT: I think I should also rename LineAndPos to LineAndCol.

jonathanvdc commented 8 years ago

Those name changes seem harmless. LineAndCol is also more intuitive than LineAndPos in my opinion. So no objection here.

Also, wow. Seems like you've got a lot on your plate. Fascinating stuff, though. Pattern matching in EC# as a macro looks like an awesome feature. I glanced at your MPL report, too. It does appear... interesting, to say the least. Printing C++ code for the parser sounds cool, but perhaps a hand-written parser would be more useful for readability and education purposes?

Oh. I've recently created a Wasm back-end for Flame (-platform wasm). I got i32 scalars to work, and I've also implemented some basic struct support. This example compiles right now (I have yet to emit a memory size specification, and figure out how to insert assertions in the generated .wast in a responsible manner. So no automated tests at this time.)

What kind of Wasm interoperability do you think JSStats is interested in, by the way? Do you mean JS-Wasm interop, inter-module interop, or is it more of a calling convention thing? I'm actually pretty interested in the latter right now. So far, I've resorted to creating my own: binaryen's implementation stores the stack pointer in linear memory, which implies accessing a variable on the stack requires not one, but two levels of pointer indirection. I prefer to prepend the stack pointer to my functions' parameter lists, and hope that it eventually gets register-allocated by the VM. My stack grows upward right now, too. Though I may change that later. Thoughts?

qwertie commented 8 years ago

JSStats seems interested in making sure that wasm modules can contain rich metadata, and he is prepared to build something on top of the official wasm spec if the core team isn't willing to work on it. Perhaps he's also interested in cross-language interoperability, but I'm uncertain. I know he wrote a LISP (ClojureScript?) project related to Wasm, but now I cannot find it on GitHub.

Yup, there are now macros to make algebraic data types and do pattern-matching on them in LeMP. Even more interesting, I designed pattern-matching to be useful even on standard POCOs that are not designed for pattern matching. EDIT: Now I just have to write the article about that. But first I'm writing an article about generating C# code and pattern-matching on C# code.

Sending SP would make a lot of sense if you had to manage it manually, but the majority of functions don't need an explicit stack pointer since the hidden, real stack (with non-addressable local variables and the return address) is all that most functions need. IIRC, storing an auxiliary SP in linear memory is the standard approach. Stacks that grow upward are better theoretically, btw. The x86 downward-growing stack is a design flaw that makes buffer overflow vulnerabilities much easier, although this issue does not really apply to wasm.

Let me know when you have a compiler that can produce small, runnable wasm modules. Do you have a blog? It would be nice to read an outsider's perspective about how to use the wasm tools. Specifically I would like to know the following: what does a simple *.wast file look like, how do you transform it into a binary, how do you run it, and is it possible yet to call wasm from Javascript?