qwertie / ecsharp

Home of LoycCore, the LES language of Loyc trees, the Enhanced C# parser, the LeMP macro preprocessor, and the LLLPG parser generator.
http://ecsharp.net
Other
176 stars 25 forks source link

JSON encoding of Loyc trees #104

Open qwertie opened 4 years ago

qwertie commented 4 years ago

@vladimir-vg mentioned he wanted to store syntax as "JSON in text file... file tree in Git or tree in IPFS". I'm not sure about those last two, but it would made sense to standardize a JSON representation of Loyc trees, so I'm sketching out a proposal. This proposal will mainly take the form of a series of examples showing how a given bit of LES3 code will be represented in JSON.

Identifiers

LES3 JSON Comment
Hello "Hello" Strings = identifiers in JSON
`` | "" The empty identifier is the empty string
`\t\0\n\u1234` | "\t\0\n\u1234" Escape sequences are largely the same
`\u01F4A9`.\u10FFFF` | "\uD83D\uDCA9.\uDBFF\uDFFF" Astral characters are surrogate pairs in JSON
`\xFF.\uD800` | "\xFF.\uD800" Invalid UTF-8 bytes are transliterated to 0xDCxx characters. High surrogates (0xD800..0xDBFF) are left alone.
# | "+#" | Single-character strings with an ASCII code of 64 or less are reserved for special purposes. Use a + prefix (#98) to define a single-character identifier with one of these values.
#if | "#if" This rule does not affect multi-char identifiers
`'+` | "'+" This rule does not affect normal operators
_ | "_" | This rule does not affect normal identifiers such as _ (ASCII 95)

Literals

LES3 JSON Comment
x"hi!" {"x": "hi!"} In general, literals become objects with one prop; the key is a "type marker"
"hi!" {"": "hi!"} The empty type marker represents a string
`@`"hi!" | {"@": 123} As usual, unusual type markers are allowed
123 123 JSON number => assume type marker is "_"
123.0 {"_": "123.0"} JSON parsers may ignore the difference between "123" and "123.0". If a floating-point number is an integer, it should be stored in string form
1234f {"_f":"1234"} The type marker starts with _ for all numbers
1234f {"_f":1234} In JSON, the second array element can be a number
123 {"_":123} As usual, it can be stored as a pair instead
true true True and false are themselves (type marker bool)
true {"bool":"true"} Same thing in cumbersome form
null null Null is itself
null {"null":""} Null in cumbersome form
json"{\"x\":123}" {"json":{ "x": 123 }} Special case: object as JSON string
json"[\"x", 123]" {"json":["x", 123]} Special case: array as JSON string
json"{\"x\":123}" {"json":"{\"x\":123}"} Using special cases is optional
json"{x:123}" {"json":"{x:123}"} This cannot be stored in object form

Note that general JSON objects like { "x":1, "y":2 } have no interpretation above, and serve as an indicator that the JSON file does not represent a Loyc tree.

Calls

LES3 JSON Comment
foo() ["foo"] Calls are arrays
1234(z) [1234, "z"] As usual, literals can be called
x"hi!"(z) [{"x":"hi!"}, "z"] As usual, literals can be called
foo(x, 2, null) ["foo", "x", 2, null] Call with 3 arguments
x + 2 ["'+", "x", 2] As usual, operators are identifiers with an apostrophe prefix
{ } ["'{}"] As usual, braced block is a call to '{}
#foo(42) ["#foo", 42] As usual, there's nothing special about #
.foo 42 ["#foo", 42] Remember, LES3's dot-notation means #
{ "x": 123 } ["'{}", ["':", {"":"x"}, 123]] JSON stored in a Loyc tree is ugly when saved in JSON
["x", 123] ["'[]", {"":"x"}, 123] JSON stored in a Loyc tree is ugly when saved in JSON
foo(x)(y) [["foo", "x"], "y"] As usual, complex targets are possible

Attributes

LES3 JSON Comment
@Foo X ["@","Foo","x"] In general, attributes are attached via arrays that start with the magic string "@"
@x foo() ["@","x",["foo"]] (which, as mentioned before, is not an identifier)
@x @y(z) foo ["@","x",["y", "z"],"foo"] There can be multiple attributes. The final item is the tree to which the attributes are attached.
@123 X ["@",123,"x"] As usual, attributes can be any Loyc tree including literals.
/*comment*/ X ["@",["%MLComment","comment"],"X"] Trivia are attached in the standard way
foo ["@","foo"] This is legal, but pointless
N/A ["@"] Meaningless and illegal
N/A "@" Meaningless and illegal

Edited Jan 21, 2021: since no one has reported interest in using the JSON encoding, I've changed parts of the proposal without notice. Most notably, backreferences and attributes now use a more compact encoding. Previously, @a @b foo() would be represented as {"@":["a","b"], "":["foo"]}, but now it's ["@","a","b", ["foo"]].

qwertie commented 4 years ago

Also, in general, Loyc trees are DAGs (directed acyclic graphs) so I would also propose the following JSON representation for tree definitions and backreferences.

LES3 JSON Comment
@.id tree(a.b.c) ["*","id", ["tree",["'.",["'.","a","b"],"c"]]] A subtree that also has a name.
@@id ["*","id"] Backreference to a previously defined subtree.
@.id2 @x tree2() ["*","id2", ["@","x", ["tree2"]]] Define a subtree with an attribute.
@x @@id2 ["@","x", ["*","id2"]] Refer to a previously defined subtree and attach an attribute.
`*`(x) | `["+*","x"] As mentioned above, certain one-character identifiers such as * must have + prepended to avoid ambiguity.

Edited Jan 2021 to make the notation more brief. The representation of @.id tree(a.b) changed from {"@@":"id","":["tree",["'.","a","b"]]} to ["*","id", ["tree",["'.","a","b"]]]; the representation of @@id changed from {"@@":"id","":[]} to ["*","id"]. The name * is intended to remind you of pointer notation in C/C#/Rust, since shared subtrees involve duplicate pointers.

Just as in LES3, a subtree definition must appear lexically before any references to it.