[Suggest] Features for AST-like results

I like Parsimmon very much.

I've finished the beginner phase and have learnt enough about the basic of Parsimmon, so I think it's time to share my suggestions.

Problem

I made a simple CSON parser quickly with few difficulties thanks to the neat and power APIs of Parsimmon.

Everything is good, except the format of parsing results which leaves much to be desired.

Usual results are just the literals of inputs consumed by the rules.

What I want are objects of the below structure:

name: the rule name
index:
- offset
- line
- column
literal: the old value of the result
...: additional properties that vary from rule to rule

For example, when parsing "height: 177.7" against the rule "assignment", I get the result:

{
    "name": "assignment",
    "index": {"offset": 0, "line": 1, "column": 1},
    "id": {
        "name": "id-struct",
        "literal": "height",
        "index": {"offset": 0, "line": 1, "column": 1}
    },
    "v": {
        "name": "lit-n",
        "literal": "177.7",
        "index": {"offset": 8, "line": 1, "column": 9}
    }
}

The format is my version of AST-like node trees.

Other users may want their versions, but largely similar.

AST is very useful.

For example, compiling to many target languages together with source maps from one source AST.

Workaround

My program works, but is not neat.

I've built a parsing/converting toolkit to ease my writing of grammar rules.

Three of its features are relevant:

Named rules system.
- A dictionary rules in which indexes are the names and values are rules;
- A function convert-rule to define and add new rules. (Does quite a few magics here.)
- A function rule-of to check and get a rule by name. (If not exists, use P.lazy because often it means recursive rule references are happening.)
For example:
```
    convert-rule('lit-s1', /'(?:\\'|[^'])*?'/);
    convert-rule('lit-s2', /"(?:\\"|[^"])*?"/);
    convert-rule('lit-s', ['alt', 'lit-s1', 'lit-s2']);
```
"lit-s1" and "lit-s2" are rules of single-quotes and double-quotes literals. "lit-s" is a rule of P.alt( rule-of('lit-s1', rules), rule-of('lit-s2', rules )
In convert-rule, replace the old literal result into an object with properties name and literal
- The key code is rule = rule.map(mapping)
- mapping is a closure function in which a variable rule-name is perserved and hence assigned to property name whenever the rule is executed.
Then, rule = P.seqMap( P.index, rule, (index, r)=>{r.index = index; return r} ) in order to add property index.
- I can understand why the use of P.index requires P.seqMap (or, P.seq). Still, I think it is against people's intuition.

Solution

It would be much better if Parsimmon provides...

A method to give a rule a name
- And a dictionary of the rules.
A transformer/interceptor/post-processor (whatever we call it) which enables users to define their version of parsing result and the transformation will happen internally and implictly.
- Users can focus on what the format is like, without bothering about how to get and mix data from here and there.

The transformer should be a callback such as this version of mine:

(name, literal, seqResults, index)=>{
    return {
        name: name, // the rule name
        literal: literal, // the literal consumed
        index: index,
        seqResults: seqResults
    }
}

Someone else may want:

(name, literal, seqResults, index)=>{
    var newResult = {
        name: name,
        literal: literal,
        seqResults: seqResults
    }
    Object.assign(newResult, index)
    return newResult;
}

seqResults is the array of multiple parts of P.seq etc.

For example, my rule "comment-single-line" is P.seq( P.regexp(/\/\/\s+/), P.regexp(/[^\n]+/) ). The result of parsing '// a single-line comment' is:

{
    "name": "comment-single-line",
    "literal": "// a single-line comment",
    "index": {"offset": 0, "line": 1, "column": 1},
    "seqResults": [
        "// ",
        "a single-line comment",
    ]
}

Note: P.regexp(/\/\/\s+/) and P.regexp(/[^\n]+/) don't have rule name so their results are just literals.

The transformation should go prior to usual P.map so the later will work on the result of the former.

After that, I add a custom mapping fuction by P.map to rule "comment-single-line" in order to keep only the second member (literal "a single-line comment") of seqResults.

The final result is:

{
    "name": "comment-single-line",
    "literal": "// a single-line comment",
    "index": {"offset": 0, "line": 1, "column": 1},
    "body": "a single-line comment"
}

Further Concerns

Parsimmon is a global stateless service.

Therefore, the transformation cannot work directly with Parsimmon; instead, it should be assigned as an optional config property of an "instance" of Parsimmon.

Also, there can be needs for more than one languages to be parsed in a project (like mine).

It is very preferable that users can create different "instances", each with a dictionary of rules and config properties.

Maybe a factory function which generate child services with almost all API of Parsimmon but works on their own rules and configs.

Hi @grandsong!

Thank you for your suggestion. I know exactly what you're talking about when it comes to having a dictionary of parsing rules wrapped in lazy.

See: https://github.com/squiggle-lang/squiggle-lang/blob/master/src/parsimmon-salad.js and: https://github.com/squiggle-lang/squiggle-lang/blob/master/src/parser.js

That being said, I think something like that would probably be best suited as a library on top of Parsimmon.

As for ASTs, I think that's probably also a case where there's enough disagreement on how to represent that that it would be well suited as an additional library.

Frankly I think if you implemented AST node helpers and a lazy dictionary of rules as a wrapper library that exposed Parsimmon for the actual parsing bits, that would be a pretty great library, but people could still depend on Parsimmon directly.

Hi.

I am not proposing a change of exisiting APIs.

If users don't set their own transformations as an optional config property, the default transformation will return the literal, so the parsing results will remain as they are now.
Nor anything will be different if users don't use "instance".

It is feasible to do all kinds of things like my workarounds on top of Parsimmon. The current Parsimmon is already very good.

But I think it would be much better if Parsimmon enables users to build an organized parser system quickly and esily without bothering (and very likey, struggling) to implement a lot of low-level chores.

So far, I have an impression that Parsimmon is good to parse small DSL to JS values so that some logics can run on them. It seems to be for runtime evaluation.

My intention, however, is to use Parsimmon for complying languages to JS. I don't need to evaluate (for example, parseInt by .map for number literals). I need AST-like node trees to generate text of target JS codes and source maps.

You create Parsimmon, so you know best about how the inside of Parsimmon works and how to improve it.

On contrast, on the outside, users like me will write (and then debug, debug, debug...) unthoughtful, half-baked and error-prone codes in various but ungraceful styles.

For example, data like index exists in the inside and will be used if an error occurs, but users outside have to write codes like rule = P.seqMap( P.index, rule, (index, r)=>{r.index = index; return r} ) which is not easy to figure out and may have unexpected and undesirable effects. Or maybe this code is actually a bad practice but the hell I don't know.

It cost me quite a lot to complete my workarounds. Yes, they work. So far. Still, I dislike my codes and worry that some bugs awaits me in the future.

That will be a big practical issue for people to fully enjoy the power of Parsimmon.

Moreover, if other users and me talk about Parsimmon, it will be hard to find a common ground to start with, let alone invite and teach more people.

Parsimmon is neat and "pure". It's safe for it to keep so, without helping users. Some users will give up, some like me will survive and go on.

However, the sooner people could starting make a truly usable middle-scale parser (system) running, the more they will accept, love and advocate Parsimmon.

The most valuable part of my proposal might be the optional transformer.

I believe that whatever formats people need, as long as they are more than the literal, people always need this feature.

It would be most preferable if users just define the common format of results so that they can happily move on to start using Parsimmon for real.

I think something like my parsimmon-salad module could cover the dictionary case. Basically all of the parsers are wrapped in lazy and injected into a common dependency-injected namespace, which allows you to easily test things or spread your parser over multiple files, and easily support recursive rules.

var ListParser = Parsimmon.createLanguage({
  _: function(lang) {
    return Parsimmon.whitespace;
  },
  Expr: function(lang) {
    return lang.Number.or(lang.List);
  },
  List: function(lang) {
    return Parsimmon.seqMap(
      Parsimmon.string('(').skip(lang._),
      lang.Expr.skip(lang._).many(),
      Parsimmon.string(')'),
      function(leftParen, items, rightParen) {
        return {
          type: 'List',
          items: items,
        };
      }
    )
  },
  Number: function(lang) {
    return Parsimmon
      .regexp(/[0-9]+/)
      .map(function(n) {
        return {
          type: 'Number',
          value: Number(n),
        };
      });
  }
});

I see your point about automatically creating objects that contain the source code location, because currently it's a little bit awkward to remember to add that in. Maybe such a createLanguage function could also augment your AST nodes with a location property automatically? What do you think of something like that?

Your parsimmon-salad is very simple. It is only one step further from your basic tutorials (or should just become a part of tutorials).

Any learner can figure out their own versions of "salad" after reading your basic tutorials and my Common Functions section.

Such versions will take few minutes and vary little. But more real practical difficulties and differences await (or should i say, will ambush) people in the future.

I have already had my workarounds which work well so far. But this is just my version of a solution. And it is not neat and general. Let alone standard and best-practice.

Imagine users re-invent their own version again and again. None can be as good as the could-be solution providedy by you.

So new users will often choose to make their own ones, after they may or may not learn from and borrow some others' codes.

Only a few users will have patience or luck to find, recognize and follow a largely good practice among possible solution patterns.

Few will smoothly move to the real rule design phase, while the rest will suffer more or less frustrations before they really create value based on Parsimmon.

Parsimmon right now is like an engine for "motorcycles". It's powerful, versatile and easy to use for some short and light-weight rides, or even performing shows.

However, for long and far travels, it will require a lot for users to be good "riders" fighting against winds, rains, dangers/risks and fatigue.

That's why four-wheel vehicles are better choices (at least, for most people who don't love challenges).

I am using Parsimmon for a more complex and systematic purpose. My workarounds are like an effort to build a "truck" which can run reliably and go far with tons of "cargo" (all boxed, therefore easy to load and safe to carry).

If a user want to parse something as simple as JSON, the current Parsimmon is enough for them.

However, it also means that most parsers in the "market" are equally enough, so there is no strong demand to favour one over others.

Yes, Parsimmon has a very good taste indeed. People may like this, but that's a different story.

Let me have a talk about the niche of Parsimmon.

Some students use a parser as an experience when studying computer courses. They are not serious users because their toys are not solutions to real world problems and demands. And they are more interested in "how most parsers works" (mostly, low-level details) than "how to make use of a certain parser". I guess Parsimmon is not really a good learning tool for such educational purpose.

Some people use a parser to parse existing languages. If the language is simple and the input is short, they can pick a random parser or write their own ones. If the language is very complex and/or the very input is long/frequent, they will look for a very mature and optimized parser specific for it, and usually they can find one.

So, the people left are new language designers/translators who have good reasons both to handcraft their own parsers and to depend their work on a general purposed parser engine.

They don't want to bother low-level things. They want to fiddle the grammars of the given language and get useful parsing results with friendly data for debugging.

Here is the niche of Parsimmon, I guess. And that's why I think it is not good enough yet.

I'm afraid this topic has been erring in a wrong direction.

Let's focus on what common confusion and chores (wow, "CCC" for short) people will have to face and sovle when using Parsimmon for middle-scale language parsing/compiling and discuss if it is preferable and reasonable for you (not a third party) to save users from these problem by providing an out-of-box solution integrated in (not outside of) Parsimmon.

How to implement such a solution should be discussed later.

This issue is too long a post.

I will write another issue as a continuation and restart of this one.

Closing this as discussion is now in https://github.com/jneen/parsimmon/issues/155

jneen / parsimmon