Ignoring contents of lines that aren't recognized

cmaughan commented 8 years ago

How would you ignore content that isn't recognized by a rule? Supposing I have something like this: string : /"(.|[^"])*"/ ; lang : /^/ /$/

This should recognize quoted strings, but would fail on anything else. So how to define the language such that it collects or discards anything that isn't a string?

orangeduck commented 8 years ago

Perhaps you can just or it with a catch all and then check after the parse which one got captured?

lang : /^/ (<string> | /.*/) /$/

cmaughan commented 8 years ago

That works I think, thanks. The combinators from grammar are a fantastic way to build an AST tree, I like it a lot. Two things that I haven't figured out:

How to discard items from the AST? Is that like an mpc_null? - is that something that would be easy to add?: num [null] : /[0-9]*/](url)
How to stop skipping the newlines for an occasional special rule. Turning on whitespace sensitivity isn't an option because it complicates everything else, but having a way to stop a rule at a new line would be cool. Something like: comment : \"//\" <any_chars> <match_eol>

cmaughan commented 8 years ago

(perhaps I can define special rules myself and reference them in the grammar?)

orangeduck commented 8 years ago

Hi @cmaughan,

Actually that is a great idea I didn't think of - to reference rules in the grammar specified via the normal parser combinator approach - probably that is the easiest way to do something like discarding contents.

For whitespace - you may be able to specify whitespace in a regex. Your previous example could be given as comment: /\\/\\/[^\\n]*\\n/. Essentially "two slashes, zero or more characters that are not a newline, then a newline".

cmaughan commented 8 years ago

Hi :) I'd wondered about using a regex like that; it seems to work, but doesn't capture 2 comments in my unit test: "\ comment 1\n\ comment 2". Maybe I'm missing something (or the grammar generation is still being a bit aggressive about discarding newlines behind my back). It also has the side effect of capturing the newline into the AST node. But I'll try making custom rules to catch these special cases - a useful thing to be able to do.

cmaughan commented 8 years ago

Oh, the 2 line thing was the final rule not checking for many ;)

cmaughan commented 8 years ago

I tried defining a parser manually and adding it into the grammar, but I get a crash here; I think this is probably because my mpc_new/mpc_define parser doesn't have the AST info some how. The value of 'a' is an invalid pointer. mpc_ast_t *mpc_ast_add_tag(mpc_ast_t *a, const char *t)

orangeduck commented 8 years ago

Hmm, perhaps you can post your code so I can see exactly what you tried?

cmaughan commented 8 years ago

Does this help? I just tried creating a recognizer for "#line" and referencing it in the linepragma line...

auto line = mpc_new("line");
mpc_define(line, mpc_oneof("#line"));
    mpca_lang(MPCA_LANG_PREDICTIVE,
        R"(
          number                : /[0-9]+/ ;
          quoted_string         : /"(\\.|[^"])*"/ ;
          linepragma            : <line> <number> <quoted_string>;
          parser                    : /^/ (<linepragma>)* /$/ ;
        )",
        line, number, quoted_string, linepragma, parser, NULL);

orangeduck commented 8 years ago

Sorry for the late reply. What is up with the R before the string and the () characters in the string? I also noticed auto, are you using C++ or something?

Did you mean mpc_oneof("#line") - this means this parser recognizes any single one of the characters in the string "#line".

Just wondering what your particular use case is as this code and grammar look a little strange.

cmaughan commented 8 years ago

yes, I'm using C++ 11; the R is a string literal (so you don't need to escape stuff, or make multiple lines of ""). auto is just a way for the compiler to deduce the type of 'line' I guess I used 'oneof' incorrectly, but it doesn't change the fact that this crashes in mpc_ast_add_tag (with a bad pointer IIRC). It should still work, whatever the parser for 'line' does, right?

orangeduck commented 8 years ago

True - let me investigate this in a bit more detail at the weekend.

cmaughan commented 8 years ago

I might get there before you, since I'm getting to the point where I need it to work; will let you know if I have time to figure it out!

orangeduck commented 8 years ago

Hi,

Looks like you were right - the error is because line is a parser which returns a const char* - but mcpa_lang expects all the input parsers to be returning mpc_ast_t*.

The fix is to make line into a parser which returns an ast with the thing it parses as the contents. This can be done using the mpcf_str_ast apply function. It is also worthwhile to give this returned thing tree a tag. Usually string literals (which I think it what you intended to parse) are tagged with string. So the only change here is to change the defition of line to the following:

mpc_define(line, mpca_tag(mpc_apply(mpc_sym("#line"), mpcf_str_ast), "string"));

Here mpc_sym is a string literal with trailing whitespace removed, mpcf_str_ast converts the string output by this parser to an ast, and mpca_tag tags this ast with the tag "string".

It isn't ideal, but in this case I think it was fine for mpc to crash as it is the programmer's responsibility to make sure all the expected parser input / output types match.

I've pushed an update to the repo with a new test in grammar.c if you want to see exactly how I got it working.

Hope this helps,

Dan

cmaughan commented 8 years ago

Thanks for investigating this, I think it makes sense! Is the 'string' tag the same thing that I'd see if I had a grammar statement like this: "string : /\"[a-z]/\" ". i.e. it's just the assigned tree tag? I can see how to apply this, so all good. I'm still trying to figure out if there's an easy way to check for a parse string and discard it from the AST tree automatically. For example:

int foo = 5;

Suppose I don't care about 'int' and '=' but it's part of the language spec. I want a simple parser that checks the grammar but discards the unwanted nodes when it builds the tree. Something like this: parser: "int"% <ident> '='% <num>

The "%" is like saying 'Require this, but don't put it in the AST tree. I guess parsing the tree afterwards is a way to do that, but it seems like it would be convenient to automatically prune it as it is generated. Maybe with mpc_pass, or something?

The only other comment I have is that the error reporting is a bit vague and hard to follow. I often get something like 'expected ', or ', or '.....''. Which can be tricky! It might be useful to instead print the name of the grammar tags that were tried. Like 'expected <eol>, <string> or <number>'

Anyway, thanks for figuring it out!

orangeduck commented 8 years ago

Hi @cmaughan,

The "string" tag is more or less like that - really it is more like the automatic tags that get added by the grammmar E.G if a rule was parsed with a regex it will automatically get the tag "regex" in the tags - but basically these are the same concepts.

Using the combinators, discarding some part of the input is typically done in the fold function. For example this parser parses the expression you mentioned (int foo = 5) and returns only the elements you wanted as a mpc_ast_t* and frees those elements not required (warning - I've not actually tested this code).

static mpc_val_t *custom_fold(int n, mpc_val_t ** xs) {
    mpc_ast_t *r = mpc_ast_new("parser|>", "");
    mpc_ast_add_child(r, mpc_ast_new("ident", xs[1]));
    mpc_ast_add_child(r, mpc_ast_new("num", xs[3]));
    free(xs[0]); free(xs[1]); free(xs[2]); free(xs[3]); 
    return r;
}

mpc_parser_t *p = mpc_and(4, custom_fold,
    mpc_sym("int"),
    mpc_tok(mpc_ident()),
    mpc_sym("="),
    mpc_digits(),
    free, free, free);

So finally this parser p returns a mpc_ast_t* which you can reference directly from mpca_lang.

Probably the normal/natural way to do this is to prune the tree afterwards but I can see the advantage of pruning at parse time so let me think about what might be reasonable syntax to do so. Do you know if YACC/Bison supports this at all?

In regards to the error messages. This is actually already supported - you just need to write a human readable name as a string inbetween the rule name and the colon : E.G:

number "Number" : -?[0-9]+;

Dan

cmaughan commented 8 years ago

Thanks for the tip on error strings - that works well and makes things much clearer. Might be worth updating the samples so people know about it. I don't know if YACC/Bison support pruning the tree - I'd imagine so, but not sure. It's been a long time since I used those tools.

orangeduck / mpc

Ignoring contents of lines that aren't recognized #53