taocpp / PEGTL

Parsing Expression Grammar Template Library
Boost Software License 1.0
1.94k stars 228 forks source link

Support for custom tokens (e.g. from a lexer)? #250

Closed Raekye closed 3 years ago

Raekye commented 3 years ago

Hello,

I found this old reddit post https://www.reddit.com/r/cpp/comments/8g920z/pegtl_parsing_expression_grammar_template_library/dya465n/ and was wondering if there's an update on this. I would like to parse Python code with PEGTL, which requires context sensitive handling of whitespace. I've already got a lexer that properly outputs indents/dedents to denote blocks and ignores other whitespace, so I would like to just plug it in with PEGTL

I didn't find any documentation related to this so I'm guessing it's not implemented yet, but I was also wondering if there were any suggestions on how I should approach this using PEGTL?

Thanks!

d-frey commented 3 years ago

@ColinH You did some examples with indention-aware languages/grammar IIRC. Could you elaborate on this?

ColinH commented 3 years ago

In the examples directory of the PEGTL, i.e. src/examples/pegtl, are two small samples that might be useful to you, one is token_input.cpp which shows how one can (ab)use the flexibiilty of the PEGTL to parse sequences of arbitrary objects, rather than just bytes, and indent_aware.cpp which shows how to set up simple indent-aware parsing.

Both are rather limited, in particular the first due to the PEGTL not (yet) being flexible enough in all places, most noteworthy being the position information that is always in terms of byte, line and column (though I wonder whether these could be made recoverable if the tokens contain this information), but it might be enough to get you going...

...and please feel free to share any progress and further questions that you might have, we are always interested in seeing use cases that push the boundaries of the current design space in order to better know in which direction and how exactly to continue development.

Raekye commented 3 years ago

I see, thanks! I'll give it a shot as soon as I have time. Are there any other potential challenges that come to mind? (just prodding, no worries if it's hard to say)

Edit: sorry if this is answered elsewhere or if I should create another issue for it, but is there a way to use PEGTL with exceptions disabled? I'm compiling my project with -fno-exceptions

ColinH commented 3 years ago

IIRC if you don't use any forms of must, an input that doesn't throw exceptions, and your actions never throw, then everything should work with exceptions disabled.

As for challenges, there will definitely be some, but you are welcome to keep us in the loop and we will se what we can do to support your adventure :-)

Raekye commented 3 years ago

I found that just including tao/pegtl.hpp pulls in cstream_reader.hpp which throws an exception, preventing me from compiling with -fno-exceptions. Is there a way to get around this?

ColinH commented 3 years ago

Quick fix, instead of tao/pegtl.hpp you can manually include everything included by it that does not throw.

@d-frey Since this is not the first time it came up we could consider a tao/pegtl_no_exceptions.hpp.

ColinH commented 3 years ago

We'll track the -fno-exceptions discussion and development in #251.

Raekye commented 3 years ago

Both are rather limited, in particular the first due to the PEGTL not (yet) being flexible enough in all places, most noteworthy being the position information that is always in terms of byte, line and column (though I wonder whether these could be made recoverable if the tokens contain this information), but it might be enough to get you going...

I'm following the example token_input.cpp. I'm trying to use standard_trace which seems to require position() be defined for token_parse_input. I could get the byte/line/column info from the tokens from my lexer. I might have missed it, but didn't find any documentation on the position class. Checking position.hpp, it seems I need to pass it a byte, line, column, and source, or an iterator with such fields. But I'm not sure where/when I should be updating it. Should I update it to the (new) current token whenever bump or restart get called? Anywhere else?

Edit: sorry if this isn't really idiomatic, but I'm curious if the dependency on std::filesystem can be removed as well if not using certain features? The install guide says:

By default the PEGTL uses std::filesystem facilities for filenames...

I would think it would be used for the default input sources, but the Inputs and Parsing page says that the source is by default a string, and file_input uses a FILE*.

ColinH commented 3 years ago

Should I update it to the (new) current token whenever bump or restart get called?

That should cover most, if not all, cases.

the source is by default a string

The source as stored or referenced by the inputs is a string because it is supposed to be more general than just filenames, though we probably need to revisit this and possibly change it to std::filesystem::path when appropriate.