open-source-ideas / ideas

💡 Looking for inspiration for your next open source project? Or perhaps you've got a brilliant idea you can't wait to share with others? Open Source Ideas is a community built specifically for this! 👋
6.53k stars 222 forks source link

DSL and a framework for building DSLs upon other languages #173

Open KOLANICH opened 5 years ago

KOLANICH commented 5 years ago

Project description

Background

Let's call a formal language a child language (please correct me if I'm wrong, I have not found any established terminology on what I gonna discuss here) if its parse tree is specified in terms of relations between AST nodes of an another formal parent language. So, when you wanna parse such a child language, you parse its parent language first, and then process the parsed AST.

There are some live examples (usually DSLs) of such:

As you guess, the hierarchy can be arbitrary deep (though uaving more than 2 levels will likely be not practical because of overhead).

Benefits of such languages are that they inherit the ecosystem of existing tools and libs for parent languages.

Problems

Solution

This proposed solution is inspired by my experience with child language based on JSON.

DSL

It is clear that the DSL should be similar to JSON Schema, but it cannot extend it because it is not only about JSON, so it should reason in terms of AST nodes, not types.

Relevant Technology

Complexity and required time

Complexity

Required time (ETA)

ricardosilvaperes commented 5 years ago

I think there already implemented servers for both languages, json and yaml.

KOLANICH commented 5 years ago

Yes, they surely can be utilised, just there is no framework for that in them. I mean not all the languages above json/yaml can be specifies as a JSONSchema, we need completions with custom code.

If we have a JSONSchema, we need to add to each property a function computing the lookup, then in a langserver check for that function, and if it is present, use it, passing it all the needed info.

I am not sure what is the best way to achieve that because the new schema should be readable from all the languages, so no functions.

The variant is to encode not functions in JS sense, but implement a DSL above Yaml/JSON to describe the ways to lookup the suggsstions (usually we want either a content of another property matching some criteria (which can also be encoded in YAML), or a path in some subtree (with adjustable delimiters), or a Yaml ref of a specific type, or ...).

Then we can either transform that DSL into target language or interpret it on the fly.

KOLANICH commented 3 years ago

@ericprud, how much is your jsg aligned with this idea? Does it allow me to rewrite the parser https://github.com/UniGrammar/UniGrammar.py/tree/master/UniGrammar/ownGrammarFormat as a jsongrammar (yes, I know that jsg is in JS, if needed it can be rewritten into another language) and get the generated code doing the same (it may have own objects for AST nodes, but it must do all the parsing (such as splitting a single dict into multiple AST nodes, nested in each other in the correct order) and validation itself)?

ericprud commented 3 years ago

Let's see if I understand well enough to answer.

If you have an abstract syntax (AS), you can create a greenfield DSL to express it and/or take some generic format (XML, RDF, JSON) and use a format-specific schema language to constrain the structures to those that are sensible in your AS.

For the greenfield DSL, you would specify it in terms of a grammar that defines the set of permissible documents, tied to AS structures (like the beige boxes in http://shex.io/shex-semantics/#shexc) and execute it with a home-grown parser or a parser generator output (yacc, Antlr...).

For the generic formats, your schema language would ideally be expressive enough to adequately define the permissible documents. (http://shex.io/shex-semantics/#shexj performs well in that regard.) If the mapping from the parsed data structure to the AS isn't obvious, you'd probably want to add some glue like the aforementioned beige boxes. I don't know that I've seen that. There are certainly plenty of examples where the mapping is assumed to be obvious, like the mapping from FHIR/XML, FHIR/JSON or FHIR/RDF to the documented model.

When you say "DSLs upon other languages", I think you're talking about what I called the format-specific schema (FSS). For "DSL" (all by itself), I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON. ShExJ doesn't do that. If you write an broken internal reference, JSG won't notice. (IIRC, DIDs offered that validation for XML, and I don't know if XML Schema or RelaxNG validators do.)

My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like. I'd like to think that a parser-generator could consume some JSG and emit that python, but thee are too many unknown unknowns for me to call that a prediction.

Aside: I've always thought it would be cool to have a parser construction kit, where you can reference productions in other grammars to build e.g. a native SVG parser rather than an SVG schema that operated over the output of a native XML parser. I'd hoped to prototype that in the yacker a couple decades ago, but never got to it.

KOLANICH commented 3 years ago

I don't quite understand what you meant in the beginning of the text. I am not very familiar to RDF and shex, but according to what I have read it seems like it is a kind of language for logic programming for creating databases of facts. I haven't got the teminology used there.

I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON.

I don't fully understand, what you mean, and taking into account

My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like.

, let me give you an example. UniGrammar is a transpilator for EBNF-like grammars for different parser gens. They are essentially the same for different parssrs gens, there are some nuances are different. UG has an own DSL for grammars. But for syntax we use YAML. Not quite YAML, JSON would fit too, since when YAML is parsed, it results into the same structure as JSON.

Here is the example:

prods:
  - id: a  # this production name is `a`
    ref: b  # terminal or nonterminal under name `b` is used
    min: 1  # one or more repetitions, in regexp syntax it is `+`
    cap: A  # make this element available in the parsed AST under the name `A`

This item is "flat", id and min and ref are siblings in YAML AST, which can look like the following in pseudocode:

Dictionary(elements=[
  DictElement(key=String("prods"), value=Array([
    Dictionary(elements=[
      DictElement(key=String("id"), value=String("a")),
      DictElement(key=String("ref"), value=String("b"))),
      DictElement(key=String("min"), value=Number(1)),
      DictElement(key=String("ref"), value=String("b"))),
    ]),
  ])
])

But our language AST should be the following:

Grammar(
  productions=Productions([
    Name("a",
      Cap("A",
        Iter(
          min = 1,
          child = Ref(child=<here reference/pointer to object with `id: b`>)
        )
    )
  ])
)

Here the types of nodes are not the same as in YAML AST. In YAML AST types of nodes say only about what is written in YAML. But in our AST types of nodes are semantical and domain-specific. The reference to b is recognized and searched in the index and populated. Also in YAML AST the nodes within the dict are at the same level, but in our AST there are multiple nested nodes. And the way they are nested cannot be changed, if one swaps nestedness, he will get nonsense. So the parser has correctly determined which syntactic features belong to which node and on which order they must be nested. In my handcoded parser it is done via a series of functions applied to each dict, and the order the functions are applied determines which type of node must be closer to the root of the YAML AST subtree.

But I don't like handcoded parsers, they are not easy to read and not easy to port.

Basically my grammar above YAML in natural language can be described as the following statements:

Basically the idea described in this issue is to have

YAML is just an example. JSON-like markup languages, such as MsgPack, CBOR, ubjson and rison, have almost (i.e. float and int instead of just Number) the same i.e. there may be multiple types for numbers) AST. For other langs, like csv, the similar logic applies.