Dynamic matching and parse tree

petersteneteg commented 6 years ago

Hi

I was trying to use the parse_tree functionality together with the dynamic matching. I tried your dynamic match example with parse_tree::parse. but was not able to compile. Firstly the long_literal_mark struct needs a analyze_t type. then I get errors about long_literal_id not having a apply or apply0.

Any suggestions?

ColinH commented 6 years ago

Can you give us the source to see exactly how you are trying to combine these two things?

petersteneteg commented 6 years ago

Sure

#include <tao/pegtl.hpp>
#include <tao/pegtl/contrib/parse_tree.hpp>

#include <cassert>

namespace pegtl = tao::TAO_PEGTL_NAMESPACE;

namespace dynamic {

struct long_literal_id : pegtl::plus<pegtl::not_one<'['> > {};
struct long_literal_open : pegtl::seq<pegtl::one<'['>, long_literal_id, pegtl::one<'['> > {};

struct long_literal_mark {
    using analyze_t = pegtl::analysis::generic<pegtl::analysis::rule_type::SEQ>;

    template <pegtl::apply_mode, pegtl::rewind_mode, template <typename...> class Action,
              template <typename...> class Control, typename Input>
    static bool match(Input& in, const std::string& id, const std::string&) {
        if (in.size(id.size()) >= id.size()) {
            if (std::memcmp(in.current(), id.data(), id.size()) == 0) {
                in.bump(id.size());
                return true;
            }
        }
        return false;
    }
};

struct long_literal_close : pegtl::seq<pegtl::one<']'>, long_literal_mark, pegtl::one<']'> > {};
struct long_literal_body : pegtl::any {};
struct grammar : pegtl::if_must<long_literal_open,
                                pegtl::until<long_literal_close, long_literal_body>, pegtl::eof> {};

template <typename Rule>
struct action : pegtl::nothing<Rule> {};

template <>
struct action<long_literal_id> {
    template <typename Input>
    static void apply(const Input& in, std::string& id, const std::string&) {
        id = in.string();
    }
};

template <>
struct action<long_literal_body> {
    template <typename Input>
    static void apply(const Input& in, const std::string&, std::string& body) {
        body += in.string();
    }
};

template <typename Rule>
using selector =
    pegtl::parse_tree::selector<Rule, 
        pegtl::parse_tree::apply_store_content::to<
            long_literal_open,
            long_literal_body,
            long_literal_close
        >
    >;

}  // namespace dynamic

int main( int argc, char** argv ) {

    std::string id;
    std::string body;

    pegtl::string_input<> in("[foo[test]foo]", "");

    // Fails to compile
    auto root =
        pegtl::parse_tree::parse<dynamic::grammar, dynamic::selector, dynamic::action>(in, id, body);

    assert(root != nullptr)
    assert(id == "foo");
    assert(body == "test");
}

ColinH commented 6 years ago

What I can see is a missing semicolon in the first of the three assert()s at the end of main, and that while within parse_tree::parse() the Action's apply methods are "shielded" from the additional state argument introduced by parse_tree::parse(), the match methods aren't, so long_literal_mark::match() needs to accept and ignore an arbitrary additional first function argument. But that still doesn't make it compile, there's something else going on here...

d-frey commented 6 years ago

That wasn't trivial to diagnose at all, but I found two problems in the parse tree support that I was able to fix.

One problem remains: The apply/apply0-methods will see the states you expect, the match-method will see an additional first state that the parse tree uses internally. I currently don't see any way to hide it. This means that after you update to the current HEAD and you added the missing semicolon after the assert() as Colin mentioned, you still need to make one more change. You need to change the signature of the match method and add a dummy parameter:

template< pegtl::apply_mode,
          pegtl::rewind_mode,
          template <typename...> class Action,
          template <typename...> class Control,
          typename Dummy,
          typename Input >
static bool match( Input& in, const Dummy&, const std::string& id, const std::string& ) {
  // ...
}

petersteneteg commented 6 years ago

Great! Thanks for the quite answer and fixes. This should solve my problems.

But I noticed that if I add the said extra dummy parameter above. Using the regular parse function

pegtl::parse<dynamic::grammar, dynamic::action>(in, id, body);

will no longer work.

d-frey commented 6 years ago

Indeed, this is why I left the issue open and I still hope that I can find a solution for it. But this is not easy, if you have any ideas, please let me know. :)

petersteneteg commented 6 years ago

For now I just added both overload of match, with and without the dummy, which seems to work fine. But a completely non-invasive solution would of course be better...

While playing with this I also tried using intrusive switching https://github.com/taocpp/PEGTL/blob/master/doc/Actions-and-States.md#intrusive-switching To be able to add the needed actions and states to the grammar to do the dynamic matching. Which would be nice since it would encapsulate all the grammar related things. But I am not able to reset the action/states to the previous ones for the "inside of the match" ie the long_literal_body here. Is there a way to do this?

d-frey commented 6 years ago

I've committed a change that solves the first problem, the internal state is now hidden and won't be visible to you in the parameter list.

I'm not sure about the intrusive switching, it might now also work or not - please give it a try and if there is still some problem, a small example would be helpful.

mkrupcale commented 6 years ago

I'm not sure moving the internal state to a raw, type-erased pointer inside of the Input argument rather than as an extra States... argument in order to try to hide it from the Control and Rule function signatures is a good solution. This solution has rather strange interpretations: what is this "internal" state and why is it inside of the Input class? Furthermore, it has consequences outside of just the parse_tree code since it will increase the size of these Input structs by the size of a pointer, regardless of whether or not parse_tree is used or any other notion of "internal" state exists.

Instead, I think it's appropriate for consistency between non-parse_tree and parse_tree code to continue to pass this extra "internal" state parameter in parse_tree (like commit d02c14107992cdc0edb558acb4b609d408cf2e8a and before) and require that custom parse_tree Control< Rule >::match and Rule::match functions be prepared to accept this extra parameter. I would argue that the match functions are special in that internal::duseltronik::match is really central to how all of PEGTL works. Furthermore, customizing the Control< Rule >::match is considered advanced[1] and custom Rules which accept States... are considered complex[2]. Therefore, if a user chooses to customize Control< Rule >::match or Rule::match, such customization is relatively advanced usage of PEGTL and requires understanding by the user that he or she must accept whatever States... are passed to the parse function, including those from the internal implementation details of parse_tree.

For now I just added both overload of match, with and without the dummy, which seems to work fine. But a completely non-invasive solution would of course be better...

Thus, this seems like the cleanest solution in my mind. Yes, it requires you to know that an internal state parameter must be accepted for parse_tree, but this is not so different than how you would use custom Rule::match with parse function alone. It may not be obvious that you need this parameter as well at first with parse_tree, but this might be better resolved with documentation than trying to work around it in the implementation.

Note that any custom Control< Rule >::(start|success|failure|raise) functions need not accept this additional internal state parameter because the parse_tree::internal::make_control methods which extend those of the user-provided Control do not pass this internal parameter down to the user. Also, my PR #127 will ensure that neither external or internal (to the Node) Action< Rule >::(apply|apply0) functions will require this internal state parameter. On that note, I had rebased my work on top of 804d2cb65e1a2f6de47ea7dad915a9d511bfee94, but with the moving target of parse_tree, I haven't had time to complete this.

[1] Advanced Control [2] Complex Rules

d-frey commented 6 years ago

I'm not worried about an extra pointer in the input classes, they are not constructed/destructed during a parsing run (unlike a marker, etc.).

I also think that from a user's point of view building a parse tree should be as consistent and as noninvasive as possible, the parse tree state is an internal detail. Hiding it in the input class leaves a lot of general rules intact that are otherwise broken, which previously lead to all sort of hacks and work arounds.

If you look at the commit where I added the internal state, it also simplifies the control class of the parse tree. And, importantly, it allows a single grammar to be used for both basic parsing (possibly with other actions) without rules and actions knowing about the parse-tree-state, and for a parsing run that builds a parse tree with everything that is connected.

Concerning PR #127: I already suggested that you create a separate implementation so we can compare them properly and you are independent of the current "moving target". Maybe both ways have their pros and cons and could live side-by-side.

mkrupcale commented 6 years ago

I'm not worried about an extra pointer in the input classes, they are not constructed/destructed during a parsing run (unlike a marker, etc.).

Fair enough, but the concept of the Input classes containing an internal state variable does not make much sense since this internal state has nothing to do with the Input class at all, and it (currently) only applies to parse_tree code.

I also think that from a user's point of view building a parse tree should be as consistent and as noninvasive as possible

It is more consistent with previous custom match implementations when one considers that the match function must accept all States... (at least for complex rules--simple rules need not worry about them at all). I'm not entirely sure what noninvasive means here because either way one must implement the custom match function to perform advanced matching, and using a Rule in a parse_tree::parse context is not the same as a default PEGTL parse because the expected output is different. Even so, the parse_tree Rule's match function need only call the non-parse_tree Rule's match function if they do not need access to the parse_tree state, and there will be no duplication of code.

the parse tree state is an internal detail.

I'm not sure it's fair to consider the state an internal detail when implementing custom match, which as always must accept all States.... Sure the implementation of the state is an internal detail to the parse_tree, but the fact that it is passed to match is not so much an internal detail. This is apparent in the implementation of parse_tree::parse and can be simply documented in the usage of parse_tree since it's not expected to change how many internal state parameters are required.

Hiding it in the input class leaves a lot of general rules intact that are otherwise broken, which previously lead to all sort of hacks and work arounds.

I'd consider hiding it in the Input class a hack of its own. What hacks were previously necessary with parse_tree? What general rules are broken without hiding the parse_tree state? I've used parse_tree with many different general rules without issue, and on a quick perusal, I don't see any built-in rules which expect a specific number of States.... Or do you mean custom, general rules external to PEGTL?

If you look at the commit where I added the internal state, it also simplifies the control class of the parse tree.

You mean by removing the implementation of start, success, failure, etc. functions for the make_control specialization when the Rule Node should not be stored and it is a leaf or parent of unselected leaves? This could already be done prior to 0346e101e0b5e101c8295bc6a17e922d3c4d1d2d.

Or do you mean simplified relative to 057db2cec87ceb47bd8fef36257a9e48fc43cd90..d02c14107992cdc0edb558acb4b609d408cf2e8a? I don't really understand why these commits were even necessary because as far as I can tell they only implement the match function just like the normal Control while still passing the internal state argument. Wouldn't this already be achieved by using the normal match function, which is the default? Is not Control2 template template parameter the same as Control? Doesn't this just come from parse,

template< typename Rule,
          template< typename... > class Action = nothing,
          template< typename... > class Control = normal,
          apply_mode A = apply_mode::ACTION,
          rewind_mode M = rewind_mode::REQUIRED,
          typename Input,
          typename... States >
bool parse( Input&& in, States&&... st )
{
   return Control< Rule >::template match< A, M, Action, Control >( in, st... );
}

and thus Control = Control2 = make_control from parse_tree?

And, importantly, it allows a single grammar to be used for both basic parsing (possibly with other actions) without rules and actions knowing about the parse-tree-state, and for a parsing run that builds a parse tree with everything that is connected.

Actions already do not need to know about the parse_tree state, both prior to 057db2cec87ceb47bd8fef36257a9e48fc43cd90 and in my own implementation of actions. It is only the match function which is special, and as I've argued, it's a rather advanced yet integral part of PEGTL which I think requires some awareness by the (advanced) user that tries to write custom Rule or Control match. There is nothing preventing a single grammar from being used for both basic parsing and parsing with parse_tree--all that is required is to overload the match function inside of any custom Rules which will be used in both. For simple Rules, no overload is even necessary, and it will work with both.

Concerning PR #127...

Okay, will do.

Possible alternative solution

One alternative solution which could make this explicit overload unnecessary is to instead move the parse_tree internal state argument to the end, after the user-provided States..., i.e.

template< typename Rule,
          typename Node,
          template< typename... > class Selector = internal::store_all,
          template< typename... > class Action = nothing,
          template< typename... > class Control = normal,
          typename Input,
          typename... States >
std::unique_ptr< Node > parse( Input&& in, States&&... st )
{
   internal::state< Node > state;
   if( !TAO_PEGTL_NAMESPACE::parse< Rule, Action, internal::make_control< Selector, Control >::template type >( in, st..., state ) ) {
      return nullptr;
   }
...
}

Obviously also the order would have to be updated in make_control functions, but this is an internal detail which shouldn't be a problem.

Then the user-defined custom match function which accepts an exact number of arguments need only provide a template parameter ExtraStates... at the end of the match function to take any additional arguments not needed by the user.

So instead of

struct complex_custom_rule
{
   template< pegtl::apply_mode,
             pegtl::rewind_mode,
             template < typename... > class Action,
             template < typename... > class Control,
             typename InternalState,
             typename Input >
   static bool match( Input& in, const InternalState&, const std::string& id, const std::string& ) {...}
}

the user would only write the templated match function

struct complex_custom_rule
{
   template< pegtl::apply_mode,
             pegtl::rewind_mode,
             template < typename... > class Action,
             template < typename... > class Control,
             typename Input,
             typename... ExtraStates >
   static bool match( Input& in, const std::string& id, const std::string&, ExtraStates&&... ) {...}
}

which should work with both basic parse and parse_tree parse. Simple rules again will still work with both.

petersteneteg commented 6 years ago

Hi, interesting discussion. I would have to agree with @mkrupcale I think adding the state to the Input is a, albeit easy, not very nice solution. As a user I think it is reasonable to think that one should be a able to implement the same parse_tree functionality your self using the existing pegtl api. And in that case adding stuff into the input is not really an option. (guess one could do ones own input classes, but that is not very nice either) And even if I could reuse the same void pointer inside of the input, I would never look for a state pointer in the input classes...

On the other hand I think that the use case above with the dynamic matching. in this case for matching [foo[...]foo] is not really a very advanced concept and seems like a common thing to do. So preferable it should work easily out of the box. From this prespective I think the last solution by mkrupcale with the extra state in the end is quite nice and unintrusive.

d-frey commented 6 years ago

I talked to @ColinH about it and we agree with both of you. I'll commit a new version shortly that will append the state at the end and will no longer try to hide it. As a consequence, however, all apply/apply0/match methods (except the "simple" match method) must now accept additional state parameters beyond what the user intends to provide to parse. We think that this is an acceptable price and will allow the rest of the implementation to be straight-forward and consistent.

taocpp / PEGTL

Dynamic matching and parse tree #129

Possible alternative solution