Closed fauxneticien closed 7 years ago
I'm not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself
So, this is the part where you're confused. (I'm not entirely sure where you've got this idea from, so I'm not sure how we can fix the nearley and/or moo documentation! :-))
Let's write you a tokenizer, first:
const lexer = moo.compile({
nl: {match: '\n', lineBreaks: true},
key: /\\[a-zA-Z]+/,
space: / +/,
contents: /.+$/,
})
The backslash-keys will look like {type: 'key', text: '\definition'}
.
Perhaps the part you're missing is that in nearley postprocessors, you get the entire Token object? So you can pull out the token's value in a postprocessor.
field -> %key %space %contents {% ([key, _, contents]) => {key: key.text, contents: contents.text} %}
It seems to me that just combining these {key, contents}
pairs into a single object will get you most of the way there...
record -> field {% ([{key, contents}]) => {[key]: contents} %}
| record %nl field {% ([record, _, {key, contents}]) => Object.assign({}, record, {[key]: contents}) %}
(Apologies for my abuse of ES6 features!)
Final note that might help, from the tokenizer docs:
Note that when using a tokenizer, raw strings match full tokens parsed by Moo. This is convenient for matching keywords.
ifStatement -> "if" condition "then" block
Feel free to ask more questions here. Thanks for such a well-written issue, sorry for my confusing response :-)
Thanks for the hints, @tjvr!
I'm not quite sure how to think about lexing when the type of the token is meant to be derived from the data itself
So, this is the part where you're confused. (I'm not entirely sure where you've got this idea from, so I'm not sure how we can fix the nearley and/or moo documentation! :-))
Entirely from my imagination, I assume 😆. I think my issue is somewhat similar to what I've seen on #274. I was/am thinking that what I want is to be able to use certain types of tokens as non-terminals in the Nearley syntax. Using @Hardmath123's response from #274, here's my attempt at doing it:
@lexer lexer
dictionary -> record:+ {% function(data) {return { "dictionary" : data[0] }} %}
record -> lexeme partOfSpeech:+ {% prettify_record %}
lexeme -> %nl:* %lx {% ([nl, token]) => value_only(token) %}
partOfSpeech -> %nl %ps {% ([nl, token]) => value_only(token) %}
@{%
const moo = require('moo');
const lexer = moo.compile({
nl: { match: '\n', lineBreaks: true },
data: { match: /\\[a-zA-Z]+\s[^\n]+/, lineBreaks:true }
});
const value_only = function(data_string) {
return data_string.value.replace(/^\\[a-z]+\s/, "")
}
const match_code = function(test_code) {
return {
test: function(token) {
match_result = token.value.match(/\\([a-z]+)/);
if(match_result == null) {
return false
} else {
return match_result[1] == test_code
}
}
};
}
const lx = match_code("lx");
const ps = match_code("ps");
const prettify_record = function(data) {
return {
"lexeme" : data[0],
"partOfSpeech" : data[1]
}
}
%}
Given the data:
\lx hello
\ps exclamation
\lx record
\ps noun
\ps verb
the grammar above produces:
{
"dictionary": [
{ "lexeme": "hello", "partOfSpeech": ["exclamation"] },
{ "lexeme": "record", "partOfSpeech": ["noun", "verb"] }
]
}
The main aim is to create/infer hierarchical structures from a flat listing of key-value pairs, and these hierarchical structures (as you can imagine) can get quite complex quite quickly; e.g. a word with multiple senses, like the to record vs. a record, each having multiple examples, and further embedded elements. I was thinking it's best to do the hierarchy building in the regular Nearley processing bit (as opposed to within the post-processors). I could be entirely wrong.
But like you've said in the other thread @tjvr, there might be "no clever way of doing this" (at least for now?).
Thanks!
I don't think #274 is relevant here.
Do you care in what order the fields appear? I assumed you didn't, which perhaps is why my initial response wasn't very helpful...
be able to use certain types of tokens as non-terminals in the Nearley syntax
Right, that's what the part I quoted in the docs is talking about. Clearly we should explain that better! The idea is that you can write "\\lx"
in your grammar, to match a token with the text \lx
.
lexeme -> "\\lx" %space %contents
You should split up the contents from the backslash-key in your tokenizer, not a postprocessor (something like I did). You could always include the space in the key, if you like:
const lexer = moo.compile({
nl: {match: '\n', lineBreaks: true},
key: /\\[a-zA-Z]+ /,
contents: /.+$/,
})
%}
lexeme -> "\\lx " %contents
I wouldn't recommend it in this case, but if you're really keen to have the type of a token change based on its value, you can always use Moo's keyword feature.
const lexer = moo.compile({
nl: {match: '\n', lineBreaks: true},
unknownKey: {match: /\\[a-zA-Z]+ /, keywords: {
lexeme: '\\lx ',
partOfSpeech: '\\ps ',
}},
contents: /.+$/,
})
...and then you can match against e.g. %lexeme
or %partOfSpeech
in your grammar.
Keys not defined in the keywords
object above will get the type %unknownKey
--which will, helpfully, cause a syntax error (unless you match %unknownKey
in your grammar, of course).
I don't think #274 is relevant here.
Do you care in what order the fields appear? I assumed you didn't, which perhaps is why my initial response wasn't very helpful...
I do, actually! I should have been more explicit about this, given that it's one of the key aims (to enforce that the project linguists enter the data in a certain order).
be able to use certain types of tokens as non-terminals in the Nearley syntax
Right, that's what the part I quoted in the docs is talking about. Clearly we should explain that better! The idea is that you can write "\lx" in your grammar, to match a token with the text \lx.
Ah, that part makes sense now—this meant to say that I can mix literals in "..."
and tokenised objects from the lexer %...
in my grammar. I had thought I was obliged to use items from (and only from) the tokeniser output. Much clearer now!
This worked out really great, thanks! A bonus for us is also that the .ne
file itself can be verbose enough for it to act as an data entry guide for the linguists, i.e. "a record is made up of the lexeme, and one or more parts of speech, each of which must be a valid part of speech: noun, etc.".
If it helps to demo, please feel free to use my current grammar/test data, where lexeme
and partOfSpeech
make use of raw strings ("\\lx "
, "\\ps "
), tokens from Moo (%newline
), and also other Nearley (non-)terminals (validPartOfSpeech
).
grammar.ne
:
@lexer lexer
dictionary ->
record:+
record ->
lexeme
partOfSpeech:+
lexeme ->
%newline:* "\\lx " %contents
partOfSpeech ->
%newline "\\ps " validPartOfSpeech
validPartOfSpeech ->
"Adverb" |
"Adjective" |
"Noun" |
"Verb"
@{%
const moo = require('moo');
const lexer = moo.compile({
newline: { match: '\n', lineBreaks: true },
key: /\\[a-zA-Z]+ /,
contents: /.+$/
});
%}
dictionary.txt
:
\lx Red
\ps Adjective
\lx Record
\ps Noun
\ps Verb
\lx All
\ps Adjective
\ps Adverb
\ps Noun
I wouldn't recommend it in this case, but if you're really keen to have the type of a token change based on its value, you can always use Moo's keyword feature.
Just out of curiosity/to learn more about Moo, what would be the disadvantage of using the keyword feature (especially if %unkownKey
, rightfully, fails with a syntax error and tells the user about it)?
what would be the disadvantage of using the keyword feature
Isn't one, really; they're different ways of doing things.
It just seems to me that in this case, it would be more effort to have to define all your keywords up-front in the tokenizer, as well as in the grammar!
Thanks for taking the time to write a detailed description of your problem; it's much easier to help people when they help us. :-) We should definitely document tokenizers better, and your feedback should help with that.
Glad I could help :-)
Just one [more] issue!
The grammar above works just fine with nearley-test
in the terminal, but not in the browser. Do you have any idea why this might be? Am I using the feed()
function incorrectly (c.f. CodePen demo link)?
nearley-test
:Notice the Unexpected "\\"
error in the Chrome Javascript console):
Hi @fauxneticien!
The catch is that you need to tell nearley that you're using a lexer. The recommended way to do this is by instantiating a Parser this way:
var p = new nearley.Parser(
nearley.Grammar.fromCompiled(grammar),
// {keepHistory: true} if you want
);
In fact, fromCompiled
is how we want everyone to be instantiating Parsers. The "grammar.ParserRules, grammar.ParserStart" idiom is a vestige from before @tjvr came and implemented lexer support in nearley. Now it's only supported for the sake of backwards-compatibility; importantly, it does not tell nearley about the existence of the lexer. So, in your CodePen, nearley tries to parse your string scannerlessly and gets confused when it sees a stream of characters instead of moo tokens.
By the way, the reason your example works in nearley-test
is that nearley-test
uses fromCompiled
internally. So I don't think this is a browser-vs.-node issue. I forked your CodePen; is this what you were expecting? https://codepen.io/anon/pen/eEjjzJ?editors=1010
One small note: parser.feed() doesn't return the results, since you can call .feed() multiple times (nearley is a streaming parser!). To actually read the results, look at the "parser.results" property. See here for more information: https://nearley.js.org/docs/parser
Hope this helps! :-)
In fact, fromCompiled is how we want everyone to be instantiating Parsers. The "grammar.ParserRules, grammar.ParserStart" idiom is a vestige from before @tjvr came and implemented lexer support in nearley.
Ah, gotcha—thanks @Hardmath123! Everything's working as expected now.
Hi all,
I was hoping to be pointed in the right direction of using Nearley for my particular use case (in order to grammar even more goodly).
Use case
I'm using Nearley to validate and parse a "sort of" domain-specific format called Toolbox format or "backslash-coded lexicons". I say "sort of", because there is a fair bit of project-to-project variation. I'm quite sure I'll have to write project-specific grammars, so I'm hoping to make the process as streamlined as possible.
What each dictionary project have in common though is that the dictionaries are written as lines of key-value pairs (keys vary project to project:
\word
,\headword
,\lx
for lexeme, etc.). A simple example of a two-word dictionary in such a format might look like:A [very hacky] Nearley grammar for this might look like:
Suppose an uncommented
{% prettify_records %}
then returns a JSON object:With the data structured in JSON, we can run queries, or make derived output formats such as a searchable web version of the dictionary or generate a printable PDF form using LaTeX (both through Hugo).
Question: where to from my basic/hacky grammar?
I think I understand that I probably should be using a lexer of some form to first process the key-value pairs. While I've looked at Moo and its documentation, I'm not quite sure how to think about lexing when the
type
of the token is meant to be derived from the data itself. As opposed to (https://nearley.js.org/docs/tokenizers):where the lexer classifies the chunks into types of
ws
,number
, ortimes
, I'm guessing for my use case, I need to be able to derive these types from the backslash codes (e.g.'\partOfSpeech Adjective'
=>{ "type" : "partOfSpeech", "value" : "Adjective" }
), which I assume/hope will help simplify the grammar(s) I write to:Any hints/help towards a more sound way of parsing such formats would be highly appreciated.
Thanks!