mewspring / speak

Experimental compiler construction playground.
The Unlicense
1 stars 1 forks source link

Project aim and design discussions #1

Open mewmew opened 7 years ago

mewmew commented 7 years ago

This is a meta issue to discuss various design decisions related to the speak tool.

NOTE: This section is under construction, and will be revised based on continuous discussions, so feel free to join.

(The intention is to explore a large set of ideas that may or may not be suited for implementation in Gocc. Implementing them from the ground up opens up the design space, and makes it possible to try wild ideas that would be cumbersome to implement in the current code base of Gocc. If the workflow of speak ends up substantially deviating from that of Gocc, it may make sense to keep both tools around.)

Aim and Objectives

The aim of this project is create an experimental playground for compiler construction.

Objectives:

  1. Generate lexers and parsers from EBNF grammars.
    • Note, making use of Extended Backus-Naur Form has been tried before by Marius, the original author of Gocc, and was implemented in what was supposed to become Gocc version 3. In the end, Marius decided against using EBNF, as it made the production actions more complicated to reason about. The intention is to explore whether it makes sense to use EBNF when production actions are no longer specified in the EBNF grammar, but rather, to look for alternative representations, since objective 2 mandates that the EBNF grammar should be language-agnostic. It may very well be the case that using BNF instead of EBNF also makes sense for speak. However, until we have tried both, there is no way of knowing.
  2. Make the EBNF grammars language-agnostic (with regards to the language of the parser).
    • Generate shift/reduce tables in CSV or JSON format (see goccmack/gocc#39).
    • Explore different ways to write/generate lexers and parsers in different programming languages that read the shift/reduce tables (and tables recording the regular expressions of tokens).
  3. Investigate whether it makes sense to separate the speak compiler construction playground into a set of tools, each intended to solve a specific problem of the larger lexer and parser generation task.
    • [x] Step 1: One tool analyzes the input grammar and extracts regular expressions for each terminal of the grammar, recording those using JSON output.
    • [ ] Step 2a. Generate DFA graph covering the regular expressions for each terminal of the input grammar. This is an experimental step, to see if the input of step 2b should be a DOT graph or a JSON file containing the regular expressions of each terminal.
    • [ ] Step 2b. A second tool parses the JSON file and is in charge of generating lexer(s) to tokenize the language recognized by the input grammar. This tool may generate lexer libraries in one or more different programming languages. Alternatively, there may exist one tool per lexer library, written in the programming language of the lexer library (as programming language often have mature environments for creating ASTs in their own language).
    • [ ] Step 3: A third tool analyzes the input grammar and generates shift/reduce tables covering each non-terminal production rule of the input grammar. The shift/reduce table may be stored either as CSV or using JSON encoding.
    • [ ] Step 4: A fourth tool parses the shift/reduce tables and creates parser libraries in one or more different programming languages. Alternatively, there may exist one tool per parser library, written in the programming language of the parser library.

Open questions: Where should production actions be stored, and how may they look like? The main issue is that the EBNF grammar should remain language-agnostic, and may therefore only contain information about the source/input programming language and not about the library/target programming languages.

The idea is to create one ENBF grammar per source/input programming language, and create a set of associated production action files, one per library/target programming language.

Any ideas on how to do this in a clean fashion which makes the mapping between EBNF input grammar and production actions clear, and which facilitates support for generating lexers and parsers for the source language in several different programming languages.

As a side note, this effort is also very much intended to help solve the slow compilations and generation times associated with Gocc as the input grammar becomes more complex. The llir/llvm project is currently experiencing lexer and parser generation times of 1.5 minutes. Sometimes which makes coding less enjoyable.

The key aim of this compiler construction playground is to make language analysis more fun again.

CC: @awalterschulze Any thoughts, input or ideas? I think you are currently hitting the same walls as I am with Gocc. Would be great to flesh out ideas for how we can improve the situation : )

mewmew commented 7 years ago

@reedkotler Perhaps you have some input with regards to the open question of how production actions may be stored. You mentioned you had previous experience working with tools that separated the productions actions from the otherwise language-agnostic input grammar. How did these tools work? Did you write two files, one for the grammar and one for the production actions (i.e. what happens on reduce)?

Any sources you may point us to for inspiration? I'll play around with a few different approaches over the coming months, and would love to gain additional input on approaches that are known to work.

Cheers /u

mewmew commented 7 years ago

To give an idea of what the input and output of these tools may look like, a proof of concept tool for step 1 has been created. The terms command extracts regular expressions for terminals of a given input grammar, and outputs them as JSON.

Example output in JSON from parsing the uC input grammar expressed in EBNF. (Note, under the covers, the terms tool makes use of the golang.org/x/exp/ebnf package, which is why the EBNF syntax looks as it does.)

{
    "names": [
        {
            "id": "ident",
            "reg": "[A-Z_a-z][0-9A-Z_a-z]*"
        },
        {
            "id": "int_lit",
            "reg": "[0-9][0-9]*"
        }
    ],
    "tokens": [
        "!",
        "!=",
        "\u0026\u0026",
        "(",
        ")",
        "*",
        "+",
        ",",
        "-",
        "/",
        ";",
        "\u003c",
        "\u003c=",
        "=",
        "==",
        "\u003e",
        "\u003e=",
        "[",
        "]",
        "else",
        "if",
        "return",
        "while",
        "{",
        "}"
    ]
}
awalterschulze commented 7 years ago

I think you might be able to push a lot of these ideas into gocc :)

gocc code generation speed

It would be really cool to speed this up, but I don't know if its possible. Code generation time is not low, but I always guessed that this time was the time it took to work out the parser tables, which I guessed had a high computational complexity. Maybe I am wrong, because I haven't actually ever looked.

exporting regexes

I think exporting the regular expressions is easy, but I think gocc does a small bit more than simply checking these regexes, when conflicting regexes exist. Maybe its again better to export a table? But I am now simply partially recalling conversations from years ago, so I might be wrong again.

language specific ast creation

You mentioned:

programming language often have mature environments for creating ASTs in their own language

Do you have some examples?

language agnostic ast creation

Some of my ideas on this topic have been towards specifying a BNF with production rules that are target language agnostic. Using an AST specified in some language agnostic way, for example protobuf. And some standard functions, like tokenToString, parseInt, etc. Then the tool you described in Step 4 can generate the tables to parse the input into a struct generated from the "protobuf" source. This way each parser, no matter the language, produces the same AST.

dream

My personal dream would be to be able to spec a language once in BNF (or whatever works) and have a parser be produced in any* language of my choice. This way the language spec itself can become a contract, like protobufs (or XSchema) are now contracts between languages.

mewmew commented 7 years ago

Hi @awalterschulze,

Thanks for getting back to me! I think I share your dream. I would also very much like to specify the grammar of a language in one BNF (or whatever works..) and generate lexers, parsers and such for a set of languages. The intention of this project is definitely to experiment with various designs that may bring about such a workflow.

Using protobuf or something similar may be interesting to look into, or some for of LISP dialect to represent ASTs in a language-agnostic fashion. For the time being, I'll focus my efforts in trying to generate language-agnostic tables for lexers (DFA transition and action tables) and parsers (shift/reduce tables). And then look into how one may tie into and make use of the shift/reduce tables in a good fashion. Experimenting is fun!

Regarding your question on parsing X in X (Go in Go, Python in Python, etc...)

You mentioned:

programming language often have mature environments for creating ASTs in their own language

Do you have some examples?

I think exporting the regular expressions is easy, but I think gocc does a small bit more than simply checking these regexes, when conflicting regexes exist. Maybe its again better to export a table?

Definitely, I'm playing around with different ways to export this table right now. Should hopefully have something up and running within the next few days.

Cheers /u

awalterschulze commented 7 years ago

On Wed, 15 Feb 2017 at 02:48 Robin Eklind notifications@github.com wrote:

Hi @awalterschulze https://github.com/awalterschulze,

Thanks for getting back to me! I think I share your dream. I would also very much like to specify the grammar of a language in one BNF (or whatever works..) and generate lexers, parsers and such for a set of languages. The intention of this project is definitely to experiment with various designs that may bring about such a workflow.

For me the AST is an integral part of this dream. Actually it does have to be an AST, but it has to be a parse tree or some structure that is the same independent of target language.

Using protobuf or something similar may be interesting to look into, or some for of LISP dialect to represent ASTs in a language-agnostic fashion. For the time being, I'll focus my efforts in trying to generate language-agnostic tables for lexers (DFA transition and action tables) and parsers (shift/reduce tables). And then look into how one may tie into and make use of the shift/reduce tables in a good fashion. Experimenting is fun!

Maybe a ParsedTree struct (class) can be generated from the EBNF as well?

Regarding your question on parsing X in X (Go in Go, Python in Python, etc...)

You mentioned:

programming language often have mature environments for creating ASTs in their own language

Do you have some examples?

Here we were talking past each other. I was asking what types of tools to target languages (like Go, Ruby, etc.) have that can transform a parsedTree into an AST.

I think exporting the regular expressions is easy, but I think gocc does a small bit more than simply checking these regexes, when conflicting regexes exist. Maybe its again better to export a table?

Definitely, I'm playing around with different ways to export this table right now. Should hopefully have something up and running within the next few days.

Cool.

Cheers /u

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mewmew/speak/issues/1#issuecomment-279894158, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvsLQ1nVDd2ZF_2_1dN1zhwcTilciysks5rcll4gaJpZM4L-svA .

awalterschulze commented 7 years ago

Proposal: generated ParsedTree instead of AST and Protobufs

Given a grammar

A: "Hello" B* C
B: string_lit
C: int_lit | id

We could generate a parsed Tree struct

type A struct {
   Hello string
   B []B
   C C
}

type B []byte

type C struct {
   int_lit []byte
   id []byte
}

Obviously there might some harder examples and we have to get this right in every language. But its an idea.