taocpp / PEGTL

Parsing Expression Grammar Template Library
Boost Software License 1.0
1.94k stars 228 forks source link

Transforming siblings with parse_tree #93

Closed engelant closed 5 years ago

engelant commented 6 years ago

I want to have a parser that translates the Syntax introduced in Bryan Ford's paper to translate to PEGTL, as I find this Syntax very readable. To do so I try to utilize the parse_tree.h provided with this library. My approch was to use a cutom node struct, that implements a to_String method like this:

struct node {
        std::vector< std::unique_ptr< node > > children;
        std::string str_before, str_after;

        std::string to_String()
        {
            std::string result = str_before;
            for(auto it = children.begin(); it != children.end();) {
                std::string child_str = it->get()->to_String();
                result += child_str;
                if(++it != children.end() && child_str != "") {
                    result += ", ";
                }
            }
            return result += str_after;
        }
...
};

To make the actual magic happen I thought to use the transform function like so:

    template< typename > struct store : std::false_type {
    };

    template<> struct store< Sequence > : std::true_type {
        static void transform(std::unique_ptr< PEG2PEGTL::node >& n)
        {
            n.get()->str_before = "sor< ";
            n.get()->str_after = " >";
        }
    };

And it works for me, until it gets a little tricky, since I would need to nest sibling inside each other, if I have a rule like this:

struct Element : seq< opt< Prefix >, Primary, opt< Suffix > > {
    };

If only I could access Element's children from Prefix and/or Suffix, and move them inside itself. I suppose I could do this by using the node's id_ and move the logic inside the Element's transform function. Is that the intended usage? How far off the intended usage am I with this approach?

d-frey commented 6 years ago

It's an interesting idea. The conceptual problem is accessing a parent from within a child. I very deliberately did not allow this, as I think it might be too confusing and it is hard to make the generic part of the code properly handle all possible transformations from that.

I'd suggest another approach which might work in the existing system: Concentrate on each node and see what you can do locally with it. If Prefix matches, it will be added as a child to opt< Prefix >. Now the transform-method of opt<> could check if it has children and replace itself with the child (or a sequence of children). If it contains nothing, it could delete itself (by calling n.reset()). That way you either end up with the child/sequence of children added to the parent or nothing being added to the parent. Would that work for you? (Note I haven't tried it out, I'm just speculating)

Also, I might be missing your point completely, I'm not exactly sure I understand what you are trying to do. Do you have a repository on GitHub I might look at to see the actual code?

engelant commented 6 years ago

So I just put it in a git repo https://github.com/engelant/PEG2PEGTL . As I described above, If I have the rule:

struct Element : seq< opt< Prefix >, Primary, opt< Suffix > > {
    };

Since I intend the recursive to_String() function to return just a string, that then is stuffed into it's parent's to_String(), I need to rearrange that Primary is a child of Suffix which again is a child of Prefix (If they exist). To directly bind It to a template transform function, where I automaticly know the Type (Prefix, Suffix, Primary) I would need to access the siblings (or it's parent's children).

On the other hand I could identify the children of Element by id_ in runtime.

Identifing Children by their count wouldn't work, as I can't differenciate between Prefix Primary / Primary Suffix.

d-frey commented 6 years ago

I looked at your code and I think that transforming the parse tree is not the right choice for you. What you'd like to do is generating the string depending on your own node type. Essentially you'd need something like

std::string to_string() const
{
  switch( id_ ) {
    case typeid( Element ):
      return to_string_element();
    //...
  }
}

(the above is pseudo-code, you'd need something that actually works)

Each string generator would than check what children it has and how to produce a reasonable string representation for them, calling their to_string()-methods as needed. You could even use the provided node class from contrib, have no transform()-methods and just one freestanding to_string()(forwarding to the type-dependent to_string()-variants.

In fact, I'll play with this idea to see if I can replace/improve our src/example/pegtl/abnf2pegtl.cpp example.

engelant commented 6 years ago

I sort of guessed to do something like this, since there must be a certain concept of usage behind the given parse_tree concept. Just wanted to make sure I'm on the same page. Hopefully I can just add this to the contrib, when it's done.

d-frey commented 6 years ago

I added a new example abnf2pegtl2 which uses the parse tree and simplifies the existing example abnf2pegtl. It is not yet complete and I think it can be further refactored, but it is a good start already, I like it a lot more than the old example. Thanks for the idea! Let me know what you think of it (it is still a bit hackish, I'm aware of that :D).

engelant commented 6 years ago

So, I had a closer look at your new code and I generally like the idea. Maybe I'll try to explain my usage goal and throw in a few ideas, since at some (admittedly early) point my knowledge with template programming is exhaused.

I imagine using a PEG2PEGTL parser/transpiler: peg2pegtl example.peg --namespace=example --out=example_pegtl_rules.h

which creates a file like this:

/** DO NOT EDIT THIS GENERATED FILE
*** EDIT SOURCE FILE example.peg instead
**/

#ifndef EXAMPLE_PEGTL_RULES_H
#define EXAMPLE_PEGTL_RULES_H

#include <tao/pegtl.hpp>

namespace PEG2PEGTL {
    using namespace tao::TAOCPP_PEGTL_NAMESPACE;

    struct LF;
    struct CR;

    struct LF : utf8::one<0x000A> {};
    struct CR : utf8::one<0x000D> {};
}
#endif

This rule file would then only contain the rules and most importantly can be built automaticly by the Makefile/IDE before the actual build process, without compromising the logic of the parser. This is also a good point to introduce UTF Identifier support (e.g [:Lu:]), which depends on the listas availible from the unicode page, to transform these into PEGTL range<>, but that's not for now, at least for me. How to define unicode ranges by properties/identifiers in c++ for PEGTL

#include <tao/pegtl/contrib/parse_tree.hpp>
#include "example_pegtl_rules.h"

namespace PEG2PEGTL {
    using namespace tao::TAOCPP_PEGTL_NAMESPACE;

    template< typename > struct store : std::false_type {};
    template<> struct store<LF> {};
}

int main() {
    ...
}

I suppose It also wouldn't be to bad of an idea to introduce an additional parse_tree_utils.h (and .cpp), which provides some default tree debug functions like print().

Speaking of helper functions, since you described a "look at the children" approach, I think it might make things easier to introduce some member functions like:

        std::vector< std::unique_ptr< node > >::iterator find_child(const std::type_info *id_, int skip) {
            this_is_a_unique_name = 100;
            std::vector< std::unique_ptr< node > >::iterator it;
            for (it = children.begin(); it != children.end(); ++it) {
                if (it->get()->id_ == id_) {
                    if (skip == 0) {
                        break;
                    } else {
                        --skip;
                    }
                }
            }
            return it;
        }

        std::vector< std::unique_ptr< node > >::iterator find_child(const std::type_info *id_) {
            return find_child(id_, 0);
        }

        bool child_has_type(const std::type_info *id_, std::vector< std::unique_ptr< node > >::size_type n) {
        return children.at(n).get()->id_ == id_;
        }

What gives te possibility to call child_has_type( &typeid( LF ), 1 ); Or maybe even reduce that with some #define lambda magic to child_has_type( LF, 1 );

Last but not least I thought of anonymous struct inheritance, but this definitly is way over my head, maybe it's not possible, maybe you can writ this down from the top of your head. Idea is to add a virtual method to the node class, e.g.

template<typename T>
void generate(T *t) {
  for (auto child in children) { //Forgive me
    child.generate(t);
  }
}

Then, and this I don't know exactly how or if to do, inside the specialisation of the store struct<LF> define something, that if it's set insted of creating the node from PEGTL::node it anonymously inherits from PEGTL::node, overwriting the virtual method and casting the pointer type to the inherited, anonymous node to the PEGTL::node class.

I'm looking very much forward to your oppinion on my suggestions.

d-frey commented 6 years ago

I'll leave aside the SO question about Unicode categories for now, I hope I'll find an answer for that in the next days and I'll answer it on SO.

For the rest, let's go through it point by point:

I imagined the use of abnf2pegtl in such a way that you'd have an input, say foo.abnf. Your build system would contain a rule to convert it from grammar.abnf to grammar.hpp. You'd include it from another header foo.hpp which looks like this:

#ifndef EXAMPLE_PEGTL_RULES_H
#define EXAMPLE_PEGTL_RULES_H

#include <tao/pegtl.hpp>

namespace PEG2PEGTL {
    using namespace tao::TAOCPP_PEGTL_NAMESPACE;
#include "grammar.hpp"
}
#endif

That way your header foo.hpp could contain anything, including (nested) namespaces, using directives, etc.

This is the reason why abnf2pegtl only emits the rules and the forward declarations, everything else can (and IMHO should) be kept separated.

Having some parse tree utility functions available for simple inclusion/usage sounds like a good idea, I'll think about it.

To improve support for accessing the children more directly, I was thinking about adding an index operator to the node. With this, you could use (*n)[1]->is< SomeType >(), or, as we always have nodes wrapped in std::unique_ptr, the easier to use at(), something like n->at(1)->is< SomeType >(). In fact that seems good enough to add it now, I'll commit it in a minute.

Finally, having some kind of callback which is defined per type would be nice, but I'm currently not sure how to do this. Maybe I'll have an idea later, we'll see.

engelant commented 6 years ago

I wasn't able to find a unicode library in c++, where I could query an Identefier, get back a list of its characters. On the other side, I never was a big fan of unicode variable/object/identifier names, so I dropped interest. For strings PEGTL has everything required, and identefiers better be ascii anyway. Just was curious in terms of completeness (e.g. ECMA16 specifies Identifier <- [:ID_Start:][:ID_Continue:]* ).

Your way of only including the generated rules inside the right namespace seems more flexible, I'll go with that.

Looking forward to try out that new bad boy of operator and maybe get a feeling of what else might come in frequently/handy.

By defined per type you mean per Rule, right?

skyrich62 commented 6 years ago

perhaps:

std::map<std::type_info, std::function<void(node*)>>

Then populate the map with your callbacks, which could be lambdas or whatever. The map could even be populated automatically.

d-frey commented 6 years ago

What I would look for is a library where I can check if a Unicode code point is within a category. That would make the rest almost trivial.

And yes, by type I mean a Rule.

Signing off for today, I really need to get some sleep :)

d-frey commented 6 years ago

FYI: I answered your StackOverflow question.

Also, I refactored the abnf2pegtl2 example a bit to support incremental alternations (=/). This lead to an interesting way to define your own node class with an overridden emplace_back-method. The latter has access to both the current node that will be inserted as well as access to the parent node, hence allowing you to modify the parent node (and it's content).

No per-Rule separation yet, though :)

d-frey commented 5 years ago

Some months have passed and I'm wondering if we can close this issue or if there are still some open points/questions. Please let me know, thanks!

engelant commented 5 years ago

Thanks for your effort, at the currrent time beeing I had to abandon my plans, so yeah, closed.