DSL and a framework for building DSLs upon other languages

Project description

Background

Let's call a formal language a child language (please correct me if I'm wrong, I have not found any established terminology on what I gonna discuss here) if its parse tree is specified in terms of relations between AST nodes of an another formal parent language. So, when you wanna parse such a child language, you parse its parent language first, and then process the parsed AST.

There are some live examples (usually DSLs) of such:

KaitaiStruct -> JSON (de-facto YAML is used)
UniGrammar -> JSON
langs used in Yandex Tomita-parser -> protobuf
XGBoost trees serialized into JSON -> JSON
various configs for various software -> JSON, TOML, ini .... (should I create a list of ones?)

As you guess, the hierarchy can be arbitrary deep (though uaving more than 2 levels will likely be not practical because of overhead).

Benefits of such languages are that they inherit the ecosystem of existing tools and libs for parent languages.

Problems

Usually such child languages have hand-coded parsers instead or ones generated from formal descriptions called grammars.
Usually they lack tooling providing syntax highlighting, autocomplete and searching. These tools usually have the features that are implemented in language servers following LSP standard.

Solution

This proposed solution is inspired by my experience with child language based on JSON.

create a DSL describing child grammars as graph node features.
- Usual grammars describe languages in terms of tokens: their parent language is a tokenized plain text, its AST nodes are tokens.
- And tokens themselves are described in terms of characters of a certain class and repeation operations. Or sometimes regexps.
- But both parsers and tokenizers are perfectly ordered and flat, it is impossible to rely on hierarchical features to create child grammars, so they rely on types and their order. Our situation is different - we don't extract any information from order, instead we extract information from their subgraphs.
  - there are following kinds of parent lang AST nodes: terminals, sequences and objects. terminals correspond to strings, numbers and nulls in JSON. sequences correspond to repeats of terminals and objects. objects are nodes having multiple subnodes, each of them of a certain type and a certain id. id describes semantics of a certain subnode. This info is extracted from a formal grammar of a lang written in UniGrammar or our lang.
    - then we specify productions as a sequence. Each consists of a matcher and an action. If a matcher matches, the action is applied and no further rules are checked.
encapsulate generation of code related to these features into separate objects called backends
generate a parser
specify relations between AST nodes and their possible values in a special DSL.

DSL

It is clear that the DSL should be similar to JSON Schema, but it cannot extend it because it is not only about JSON, so it should reason in terms of AST nodes, not types.

Relevant Technology

jsonschema
LSP

Complexity and required time

Complexity

[x] Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology

Required time (ETA)

[x] Much work - The project will take more than a couple of weeks and serious planning is required

I think there already implemented servers for both languages, json and yaml.

Yes, they surely can be utilised, just there is no framework for that in them. I mean not all the languages above json/yaml can be specifies as a JSONSchema, we need completions with custom code.

If we have a JSONSchema, we need to add to each property a function computing the lookup, then in a langserver check for that function, and if it is present, use it, passing it all the needed info.

I am not sure what is the best way to achieve that because the new schema should be readable from all the languages, so no functions.

The variant is to encode not functions in JS sense, but implement a DSL above Yaml/JSON to describe the ways to lookup the suggsstions (usually we want either a content of another property matching some criteria (which can also be encoded in YAML), or a path in some subtree (with adjustable delimiters), or a Yaml ref of a specific type, or ...).

Then we can either transform that DSL into target language or interpret it on the fly.

@ericprud, how much is your jsg aligned with this idea? Does it allow me to rewrite the parser https://github.com/UniGrammar/UniGrammar.py/tree/master/UniGrammar/ownGrammarFormat as a jsongrammar (yes, I know that jsg is in JS, if needed it can be rewritten into another language) and get the generated code doing the same (it may have own objects for AST nodes, but it must do all the parsing (such as splitting a single dict into multiple AST nodes, nested in each other in the correct order) and validation itself)?

Let's see if I understand well enough to answer.

If you have an abstract syntax (AS), you can create a greenfield DSL to express it and/or take some generic format (XML, RDF, JSON) and use a format-specific schema language to constrain the structures to those that are sensible in your AS.

For the greenfield DSL, you would specify it in terms of a grammar that defines the set of permissible documents, tied to AS structures (like the beige boxes in http://shex.io/shex-semantics/#shexc) and execute it with a home-grown parser or a parser generator output (yacc, Antlr...).

For the generic formats, your schema language would ideally be expressive enough to adequately define the permissible documents. (http://shex.io/shex-semantics/#shexj performs well in that regard.) If the mapping from the parsed data structure to the AS isn't obvious, you'd probably want to add some glue like the aforementioned beige boxes. I don't know that I've seen that. There are certainly plenty of examples where the mapping is assumed to be obvious, like the mapping from FHIR/XML, FHIR/JSON or FHIR/RDF to the documented model.

When you say "DSLs upon other languages", I think you're talking about what I called the format-specific schema (FSS). For "DSL" (all by itself), I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON. ShExJ doesn't do that. If you write an broken internal reference, JSG won't notice. (IIRC, DIDs offered that validation for XML, and I don't know if XML Schema or RelaxNG validators do.)

My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like. I'd like to think that a parser-generator could consume some JSG and emit that python, but thee are too many unknown unknowns for me to call that a prediction.

Aside: I've always thought it would be cool to have a parser construction kit, where you can reference productions in other grammars to build e.g. a native SVG parser rather than an SVG schema that operated over the output of a native XML parser. I'd hoped to prototype that in the yacker a couple decades ago, but never got to it.

I don't quite understand what you meant in the beginning of the text. I am not very familiar to RDF and shex, but according to what I have read it seems like it is a kind of language for logic programming for creating databases of facts. I haven't got the teminology used there.

I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON.

I don't fully understand, what you mean, and taking into account

My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like.

, let me give you an example. UniGrammar is a transpilator for EBNF-like grammars for different parser gens. They are essentially the same for different parssrs gens, there are some nuances are different. UG has an own DSL for grammars. But for syntax we use YAML. Not quite YAML, JSON would fit too, since when YAML is parsed, it results into the same structure as JSON.

Here is the example:

prods:
  - id: a  # this production name is `a`
    ref: b  # terminal or nonterminal under name `b` is used
    min: 1  # one or more repetitions, in regexp syntax it is `+`
    cap: A  # make this element available in the parsed AST under the name `A`

This item is "flat", id and min and ref are siblings in YAML AST, which can look like the following in pseudocode:

Dictionary(elements=[
  DictElement(key=String("prods"), value=Array([
    Dictionary(elements=[
      DictElement(key=String("id"), value=String("a")),
      DictElement(key=String("ref"), value=String("b"))),
      DictElement(key=String("min"), value=Number(1)),
      DictElement(key=String("ref"), value=String("b"))),
    ]),
  ])
])

But our language AST should be the following:

Grammar(
  productions=Productions([
    Name("a",
      Cap("A",
        Iter(
          min = 1,
          child = Ref(child=<here reference/pointer to object with `id: b`>)
        )
    )
  ])
)

Here the types of nodes are not the same as in YAML AST. In YAML AST types of nodes say only about what is written in YAML. But in our AST types of nodes are semantical and domain-specific. The reference to b is recognized and searched in the index and populated. Also in YAML AST the nodes within the dict are at the same level, but in our AST there are multiple nested nodes. And the way they are nested cannot be changed, if one swaps nestedness, he will get nonsense. So the parser has correctly determined which syntactic features belong to which node and on which order they must be nested. In my handcoded parser it is done via a series of functions applied to each dict, and the order the functions are applied determines which type of node must be closer to the root of the YAML AST subtree.

But I don't like handcoded parsers, they are not easy to read and not easy to port.

Basically my grammar above YAML in natural language can be described as the following statements:

the root must map to Grammar AST node.
Grammar AST node is made from a Dictionary of YAML AST
the dict may contain the key prods which maps to our property productions of type Productions
Productions is itself a collection, that is made from a YAML array of stuff, each element of which must be either a Name, or any other node allowed there, such as Spacer
When an element is made from a YAML Dict, having a DictElement with the key String('id') which value is of type String, it is a Name, which id must be set to the value of the DictElement and the rest of Dict is further processed and the result is set to child.
When an element is made from a YAML Dict, having a DictElement with the key String('cap') which value is of type String, it is a Cap, which name must be set to the value of the DictElement and the rest of Dict is further processed and the result is set to child.
When an element is made from a YAML Dict, having a DictElement with the key String('min') which value is of type Number and is greater or equal than 0, it is an Iter, which min must be set to the value of the DictElement and the rest of Dict is further processed and the result is set to child.
When an element is made from a YAML Dict, having a DictElement with the key String('ref') which value is of type String, it is a Ref, which child must be set to the previously created node of type Name which id is equal to the value of the DictElement.
each Iter must be encapsulted into an own Name or into an element that must be encapsulted into an own Name
each Cap must be encapsulted into an own Name

Basically the idea described in this issue is to have

a DSL allowing to specify such constraints easily and declaratively
a compiler that would generate me the code
- validating the result of YAML parsing
- transforming the result of YAML parsing into my AST
- if there are any errors, reporting them
a language server, that would assist me when editing YAML. The constraint that the value of ref must be a previously-mentioned id means that when I activate autocompletion after ref:, the language server should cause the editor suggest me in the drop-down menu only the ones that were used in ids prior that item.

YAML is just an example. JSON-like markup languages, such as MsgPack, CBOR, ubjson and rison, have almost (i.e. float and int instead of just Number) the same i.e. there may be multiple types for numbers) AST. For other langs, like csv, the similar logic applies.

open-source-ideas / ideas