kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.6k stars 231 forks source link

Code review/feedback: parsing dictionary formats with Nearley #278

Closed fauxneticien closed 7 years ago

fauxneticien commented 7 years ago

Hi all,

I was hoping to be pointed in the right direction of using Nearley for my particular use case (in order to grammar even more goodly).

Use case

I'm using Nearley to validate and parse a "sort of" domain-specific format called Toolbox format or "backslash-coded lexicons". I say "sort of", because there is a fair bit of project-to-project variation. I'm quite sure I'll have to write project-specific grammars, so I'm hoping to make the process as streamlined as possible.

What each dictionary project have in common though is that the dictionaries are written as lines of key-value pairs (keys vary project to project: \word, \headword, \lx for lexeme, etc.). A simple example of a two-word dictionary in such a format might look like:

\word rouge
\partOfSpeech Adjective
\definition red

\word bonjour
\partOfSpeech Exclamation
\definition hello
\exampleFrench Il a dit 'bonjour'
\translationEnglish He said 'hello'

A [very hacky] Nearley grammar for this might look like:

dictionary  -> record:+                     # {% prettify_records %}

record      -> word partOfSpeech definition example:*

word        -> _NL:*    "\\word"                      _ _ABNL

partOfSpeech-> _NL      "\\partOfSpeech"              _ validPartOfSpeech
    validPartOfSpeech   -> "Adjective" | "Exclamation"

definition  -> _NL      "\\definition"                _ _ABNL

example     -> exampleFrench translationEnglish
    exampleFrench       -> _NL "\\exampleFrench"      _ _ABNL
    translationEnglish  -> _NL "\\translationEnglish" _ _ABNL

_NL         -> "\n"     # New line
_ABNL       -> [^\n]:+  # All but new line
_           -> " "      # Space

Suppose an uncommented {% prettify_records %} then returns a JSON object:

{
    "dictionary" : [
        {
            "word" :        "rouge",
            "partOfSpeech": "Adjective",
            "definition" :  "red",
            "examples" : []
        },
        {
            "word" :         "bonjour",
            "partOfSpeech" : "Exclamation",
            "definition" :   "hello",
            "examples" : [
                {
                     "exampleFrench":       "Il a dit 'bonjour'",
                     "englishTranslation" : "He said 'hello'"
                }
            ]
        }
    ]
}

With the data structured in JSON, we can run queries, or make derived output formats such as a searchable web version of the dictionary or generate a printable PDF form using LaTeX (both through Hugo).

Question: where to from my basic/hacky grammar?

I think I understand that I probably should be using a lexer of some form to first process the key-value pairs. While I've looked at Moo and its documentation, I'm not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself. As opposed to (https://nearley.js.org/docs/tokenizers):

@{%
const moo = require("moo");

const lexer = moo.compile({
  ws:     /[ \t]+/,
  number: /[0-9]+/,
  times:  /\*|x/
});
%}

# Pass your lexer object using the @lexer option:
@lexer lexer

# Use %token to match any token of that type instead of "token":
multiplication -> %number %ws %times %ws %number {% ([first, , , , second]) => first * second %}

where the lexer classifies the chunks into types of ws, number, or times, I'm guessing for my use case, I need to be able to derive these types from the backslash codes (e.g. '\partOfSpeech Adjective' => { "type" : "partOfSpeech", "value" : "Adjective" }), which I assume/hope will help simplify the grammar(s) I write to:

@lexer someCustomLexer

dictionary  -> record:+                                      # {% prettify_records %}
record      -> %word %partOfSpeech %definition %example:*

# No terminals down here!

Any hints/help towards a more sound way of parsing such formats would be highly appreciated.

Thanks!

tjvr commented 7 years ago

I'm not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself

So, this is the part where you're confused. (I'm not entirely sure where you've got this idea from, so I'm not sure how we can fix the nearley and/or moo documentation! :-))

Let's write you a tokenizer, first:

const lexer = moo.compile({
  nl: {match: '\n', lineBreaks: true},
  key: /\\[a-zA-Z]+/,
  space: / +/,
  contents: /.+$/,
})

The backslash-keys will look like {type: 'key', text: '\definition'}.

Perhaps the part you're missing is that in nearley postprocessors, you get the entire Token object? So you can pull out the token's value in a postprocessor.

field -> %key %space %contents {% ([key, _, contents]) => {key: key.text, contents: contents.text} %}

It seems to me that just combining these {key, contents} pairs into a single object will get you most of the way there...

record -> field             {% ([{key, contents}]) => {[key]: contents} %}
        | record %nl field  {% ([record, _, {key, contents}]) => Object.assign({}, record, {[key]: contents}) %}

(Apologies for my abuse of ES6 features!)

Final note that might help, from the tokenizer docs:

Note that when using a tokenizer, raw strings match full tokens parsed by Moo. This is convenient for matching keywords.

ifStatement -> "if" condition "then" block

Feel free to ask more questions here. Thanks for such a well-written issue, sorry for my confusing response :-)

fauxneticien commented 7 years ago

Thanks for the hints, @tjvr!

I'm not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself

So, this is the part where you're confused. (I'm not entirely sure where you've got this idea from, so I'm not sure how we can fix the nearley and/or moo documentation! :-))

Entirely from my imagination, I assume 😆. I think my issue is somewhat similar to what I've seen on #274. I was/am thinking that what I want is to be able to use certain types of tokens as non-terminals in the Nearley syntax. Using @Hardmath123's response from #274, here's my attempt at doing it:

@lexer lexer

dictionary   -> record:+                {% function(data) {return { "dictionary" : data[0] }} %}

record       -> lexeme partOfSpeech:+   {% prettify_record %}

lexeme       -> %nl:* %lx               {% ([nl, token]) => value_only(token) %} 
partOfSpeech -> %nl   %ps               {% ([nl, token]) => value_only(token) %}

@{%

const moo = require('moo');

const lexer = moo.compile({
    nl:      { match: '\n', lineBreaks: true },
    data:    { match: /\\[a-zA-Z]+\s[^\n]+/, lineBreaks:true }
});

const value_only = function(data_string) {
    return data_string.value.replace(/^\\[a-z]+\s/, "")
}

const match_code = function(test_code) {
    return {
        test: function(token) {
            match_result = token.value.match(/\\([a-z]+)/);

            if(match_result == null) {
                return false
            } else {
                return match_result[1] == test_code
            }
        }
    };
}

const lx  = match_code("lx");
const ps  = match_code("ps");

const prettify_record = function(data) {
    return {
        "lexeme" : data[0],
        "partOfSpeech" : data[1]
    }
}

%}

Given the data:

\lx hello
\ps exclamation

\lx record
\ps noun
\ps verb

the grammar above produces:

{
    "dictionary": [
        { "lexeme": "hello",    "partOfSpeech": ["exclamation"] },
        { "lexeme": "record",   "partOfSpeech": ["noun", "verb"] }
   ]
}

The main aim is to create/infer hierarchical structures from a flat listing of key-value pairs, and these hierarchical structures (as you can imagine) can get quite complex quite quickly; e.g. a word with multiple senses, like the to record vs. a record, each having multiple examples, and further embedded elements. I was thinking it's best to do the hierarchy building in the regular Nearley processing bit (as opposed to within the post-processors). I could be entirely wrong.

But like you've said in the other thread @tjvr, there might be "no clever way of doing this" (at least for now?).

Thanks!

tjvr commented 7 years ago

I don't think #274 is relevant here.

Do you care in what order the fields appear? I assumed you didn't, which perhaps is why my initial response wasn't very helpful...

tjvr commented 7 years ago

be able to use certain types of tokens as non-terminals in the Nearley syntax

Right, that's what the part I quoted in the docs is talking about. Clearly we should explain that better! The idea is that you can write "\\lx" in your grammar, to match a token with the text \lx.

lexeme       ->  "\\lx" %space %contents

You should split up the contents from the backslash-key in your tokenizer, not a postprocessor (something like I did). You could always include the space in the key, if you like:

const lexer = moo.compile({
  nl: {match: '\n', lineBreaks: true},
  key: /\\[a-zA-Z]+ /,
  contents: /.+$/,
})
%}
lexeme       ->  "\\lx " %contents
tjvr commented 7 years ago

I wouldn't recommend it in this case, but if you're really keen to have the type of a token change based on its value, you can always use Moo's keyword feature.

const lexer = moo.compile({
  nl: {match: '\n', lineBreaks: true},
  unknownKey: {match: /\\[a-zA-Z]+ /, keywords: {
    lexeme: '\\lx ',
    partOfSpeech: '\\ps ',
  }},
  contents: /.+$/,
})

...and then you can match against e.g. %lexeme or %partOfSpeech in your grammar.

Keys not defined in the keywords object above will get the type %unknownKey--which will, helpfully, cause a syntax error (unless you match %unknownKey in your grammar, of course).

fauxneticien commented 7 years ago

I don't think #274 is relevant here.

Do you care in what order the fields appear? I assumed you didn't, which perhaps is why my initial response wasn't very helpful...

I do, actually! I should have been more explicit about this, given that it's one of the key aims (to enforce that the project linguists enter the data in a certain order).

be able to use certain types of tokens as non-terminals in the Nearley syntax

Right, that's what the part I quoted in the docs is talking about. Clearly we should explain that better! The idea is that you can write "\lx" in your grammar, to match a token with the text \lx.

Ah, that part makes sense now—this meant to say that I can mix literals in "..." and tokenised objects from the lexer %... in my grammar. I had thought I was obliged to use items from (and only from) the tokeniser output. Much clearer now!

This worked out really great, thanks! A bonus for us is also that the .ne file itself can be verbose enough for it to act as an data entry guide for the linguists, i.e. "a record is made up of the lexeme, and one or more parts of speech, each of which must be a valid part of speech: noun, etc.".

If it helps to demo, please feel free to use my current grammar/test data, where lexeme and partOfSpeech make use of raw strings ("\\lx ", "\\ps "), tokens from Moo (%newline), and also other Nearley (non-)terminals (validPartOfSpeech).

I wouldn't recommend it in this case, but if you're really keen to have the type of a token change based on its value, you can always use Moo's keyword feature.

Just out of curiosity/to learn more about Moo, what would be the disadvantage of using the keyword feature (especially if %unkownKey, rightfully, fails with a syntax error and tells the user about it)?

tjvr commented 7 years ago

what would be the disadvantage of using the keyword feature

Isn't one, really; they're different ways of doing things.

It just seems to me that in this case, it would be more effort to have to define all your keywords up-front in the tokenizer, as well as in the grammar!


Thanks for taking the time to write a detailed description of your problem; it's much easier to help people when they help us. :-) We should definitely document tokenizers better, and your feedback should help with that.

Glad I could help :-)

fauxneticien commented 7 years ago

Just one [more] issue!

The grammar above works just fine with nearley-test in the terminal, but not in the browser. Do you have any idea why this might be? Am I using the feed() function incorrectly (c.f. CodePen demo link)?

Working with nearley-test:

asciicast

But not in browser (CodePen link)

Notice the Unexpected "\\" error in the Chrome Javascript console):

screen shot 2017-08-25 at 12 17 36 pm

kach commented 7 years ago

Hi @fauxneticien!

The catch is that you need to tell nearley that you're using a lexer. The recommended way to do this is by instantiating a Parser this way:

var p = new nearley.Parser(
    nearley.Grammar.fromCompiled(grammar),
    // {keepHistory: true} if you want
);

In fact, fromCompiled is how we want everyone to be instantiating Parsers. The "grammar.ParserRules, grammar.ParserStart" idiom is a vestige from before @tjvr came and implemented lexer support in nearley. Now it's only supported for the sake of backwards-compatibility; importantly, it does not tell nearley about the existence of the lexer. So, in your CodePen, nearley tries to parse your string scannerlessly and gets confused when it sees a stream of characters instead of moo tokens.

By the way, the reason your example works in nearley-test is that nearley-test uses fromCompiled internally. So I don't think this is a browser-vs.-node issue. I forked your CodePen; is this what you were expecting? https://codepen.io/anon/pen/eEjjzJ?editors=1010

One small note: parser.feed() doesn't return the results, since you can call .feed() multiple times (nearley is a streaming parser!). To actually read the results, look at the "parser.results" property. See here for more information: https://nearley.js.org/docs/parser

Hope this helps! :-)

fauxneticien commented 7 years ago

In fact, fromCompiled is how we want everyone to be instantiating Parsers. The "grammar.ParserRules, grammar.ParserStart" idiom is a vestige from before @tjvr came and implemented lexer support in nearley.

Ah, gotcha—thanks @Hardmath123! Everything's working as expected now.