Open KOLANICH opened 5 years ago
Yes, they surely can be utilised, just there is no framework for that in them. I mean not all the languages above json/yaml can be specifies as a JSONSchema, we need completions with custom code.
If we have a JSONSchema, we need to add to each property a function computing the lookup, then in a langserver check for that function, and if it is present, use it, passing it all the needed info.
I am not sure what is the best way to achieve that because the new schema should be readable from all the languages, so no functions.
The variant is to encode not functions in JS sense, but implement a DSL above Yaml/JSON to describe the ways to lookup the suggsstions (usually we want either a content of another property matching some criteria (which can also be encoded in YAML), or a path in some subtree (with adjustable delimiters), or a Yaml ref of a specific type, or ...).
Then we can either transform that DSL into target language or interpret it on the fly.
@ericprud, how much is your jsg aligned with this idea? Does it allow me to rewrite the parser https://github.com/UniGrammar/UniGrammar.py/tree/master/UniGrammar/ownGrammarFormat as a jsongrammar (yes, I know that jsg is in JS, if needed it can be rewritten into another language) and get the generated code doing the same (it may have own objects for AST nodes, but it must do all the parsing (such as splitting a single dict
into multiple AST nodes, nested in each other in the correct order) and validation itself)?
Let's see if I understand well enough to answer.
If you have an abstract syntax (AS), you can create a greenfield DSL to express it and/or take some generic format (XML, RDF, JSON) and use a format-specific schema language to constrain the structures to those that are sensible in your AS.
For the greenfield DSL, you would specify it in terms of a grammar that defines the set of permissible documents, tied to AS structures (like the beige boxes in http://shex.io/shex-semantics/#shexc) and execute it with a home-grown parser or a parser generator output (yacc, Antlr...).
For the generic formats, your schema language would ideally be expressive enough to adequately define the permissible documents. (http://shex.io/shex-semantics/#shexj performs well in that regard.) If the mapping from the parsed data structure to the AS isn't obvious, you'd probably want to add some glue like the aforementioned beige boxes. I don't know that I've seen that. There are certainly plenty of examples where the mapping is assumed to be obvious, like the mapping from FHIR/XML, FHIR/JSON or FHIR/RDF to the documented model.
When you say "DSLs upon other languages", I think you're talking about what I called the format-specific schema (FSS). For "DSL" (all by itself), I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON. ShExJ doesn't do that. If you write an broken internal reference, JSG won't notice. (IIRC, DIDs offered that validation for XML, and I don't know if XML Schema or RelaxNG validators do.)
My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like. I'd like to think that a parser-generator could consume some JSG and emit that python, but thee are too many unknown unknowns for me to call that a prediction.
Aside: I've always thought it would be cool to have a parser construction kit, where you can reference productions in other grammars to build e.g. a native SVG parser rather than an SVG schema that operated over the output of a native XML parser. I'd hoped to prototype that in the yacker a couple decades ago, but never got to it.
I don't quite understand what you meant in the beginning of the text. I am not very familiar to RDF and shex, but according to what I have read it seems like it is a kind of language for logic programming for creating databases of facts. I haven't got the teminology used there.
I think you mean either a meta language that can be translated into an FSS, or an anointed FSS that can be translated to others. In either case, some formats will require additional constraints to e.g. decide between elements and attributes in XML. You may also want to capture graph references (href->id, IDREF->ID) in tree languages like XML or JSON.
I don't fully understand, what you mean, and taking into account
My py-fu is too weak for me to really interpret ownGrammarFormat as a grammar but I see little hints that give me some sense of what a BNF for the AS would look like.
, let me give you an example. UniGrammar is a transpilator for EBNF-like grammars for different parser gens. They are essentially the same for different parssrs gens, there are some nuances are different. UG has an own DSL for grammars. But for syntax we use YAML. Not quite YAML, JSON would fit too, since when YAML is parsed, it results into the same structure as JSON.
Here is the example:
prods:
- id: a # this production name is `a`
ref: b # terminal or nonterminal under name `b` is used
min: 1 # one or more repetitions, in regexp syntax it is `+`
cap: A # make this element available in the parsed AST under the name `A`
This item is "flat", id
and min
and ref
are siblings in YAML AST, which can look like the following in pseudocode:
Dictionary(elements=[
DictElement(key=String("prods"), value=Array([
Dictionary(elements=[
DictElement(key=String("id"), value=String("a")),
DictElement(key=String("ref"), value=String("b"))),
DictElement(key=String("min"), value=Number(1)),
DictElement(key=String("ref"), value=String("b"))),
]),
])
])
But our language AST should be the following:
Grammar(
productions=Productions([
Name("a",
Cap("A",
Iter(
min = 1,
child = Ref(child=<here reference/pointer to object with `id: b`>)
)
)
])
)
Here the types of nodes are not the same as in YAML AST. In YAML AST types of nodes say only about what is written in YAML. But in our AST types of nodes are semantical and domain-specific. The reference to b
is recognized and searched in the index and populated. Also in YAML AST the nodes within the dict are at the same level, but in our AST there are multiple nested nodes. And the way they are nested cannot be changed, if one swaps nestedness, he will get nonsense. So the parser has correctly determined which syntactic features belong to which node and on which order they must be nested. In my handcoded parser it is done via a series of functions applied to each dict, and the order the functions are applied determines which type of node must be closer to the root of the YAML AST subtree.
But I don't like handcoded parsers, they are not easy to read and not easy to port.
Basically my grammar above YAML in natural language can be described as the following statements:
Grammar
AST node.Grammar
AST node is made from a Dictionary
of YAML ASTprods
which maps to our property productions
of type Productions
Productions
is itself a collection, that is made from a YAML array of stuff, each element of which must be either a Name
, or any other node allowed there, such as Spacer
Dict
, having a DictElement
with the key
String('id')
which value
is of type String
, it is a Name
, which id
must be set to the value of the DictElement
and the rest of Dict
is further processed and the result is set to child
.Dict
, having a DictElement
with the key
String('cap')
which value
is of type String
, it is a Cap
, which name
must be set to the value of the DictElement
and the rest of Dict
is further processed and the result is set to child
.Dict
, having a DictElement
with the key
String('min')
which value
is of type Number
and is greater or equal than 0
, it is an Iter
, which min
must be set to the value of the DictElement
and the rest of Dict
is further processed and the result is set to child
.Dict
, having a DictElement
with the key
String('ref')
which value
is of type String
, it is a Ref
, which child
must be set to the previously created node of type Name
which id
is equal to the value
of the DictElement
.Iter
must be encapsulted into an own Name
or into an element that must be encapsulted into an own Name
Cap
must be encapsulted into an own Name
Basically the idea described in this issue is to have
ref
must be a previously-mentioned id
means that when I activate autocompletion after ref:
, the language server should cause the editor suggest me in the drop-down menu only the ones that were used in id
s prior that item.YAML is just an example. JSON-like markup languages, such as MsgPack, CBOR, ubjson and rison, have almost (i.e. float and int instead of just Number) the same i.e. there may be multiple types for numbers) AST. For other langs, like csv, the similar logic applies.
Project description
Background
Let's call a formal language a
child language
(please correct me if I'm wrong, I have not found any established terminology on what I gonna discuss here) if its parse tree is specified in terms of relations between AST nodes of an another formalparent language
. So, when you wanna parse such a child language, you parse its parent language first, and then process the parsed AST.There are some live examples (usually DSLs) of such:
As you guess, the hierarchy can be arbitrary deep (though uaving more than 2 levels will likely be not practical because of overhead).
Benefits of such languages are that they inherit the ecosystem of existing tools and libs for
parent languages
.Problems
language servers
following LSP standard.Solution
This proposed solution is inspired by my experience with child language based on JSON.
create a DSL describing child grammars as graph node features.
parent language
is a tokenized plain text, its AST nodes are tokens.terminal
s,sequence
s andobject
s.terminals
correspond to strings, numbers and nulls in JSON.sequence
s correspond to repeats ofterminal
s andobject
s.object
s are nodes having multiple subnodes, each of them of a certain type and a certainid
.id
describes semantics of a certain subnode. This info is extracted from a formal grammar of a lang written in UniGrammar or our lang.encapsulate generation of code related to these features into separate objects called backends
generate a parser
specify relations between AST nodes and their possible values in a special DSL.
DSL
It is clear that the DSL should be similar to JSON Schema, but it cannot extend it because it is not only about JSON, so it should reason in terms of AST nodes, not types.
Relevant Technology
Complexity and required time
Complexity
Required time (ETA)