google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 663 forks source link

question: walking the tree #394

Closed magna25 closed 7 years ago

magna25 commented 7 years ago

This might be a silly question but can't find the examples on how to walk the output tree.

GumboOutput* output = gumbo_parse("<h1>Hello  <span>World!</span></h1>");

GumboNode* root = output->root;
<body>
    <div>
        Hello <span>World!</span>
        <div>
            <span>I'm a span</span>
        </div>
    </div>
    <div id="test">Second div</div>
</body>
//trying to turn the above html in to something like below

Node[0] => {
    tagName: "body",
    text: null,
    Attributes: null,
    children: {
        Node[0] => {
            tagName: "div",
            text: "Hello",
            Attributes: null,
            children: {
                Node[0] => {
                    tagName: "span",
                    text: "World!",
                    Attributes: null,
                    children: null
                }
                Node[1] => {
                    tagName: "div",
                    text: null,
                    Attributes: null,
                    children:{
                        Node[0] => {
                            tagName: "span",
                            text: "I'm a span",
                            Attributes: null,
                            children: null
                        }
                    }
                }
            }
        }
        Node[1] => {
            tagName: "div",
            text: "Second div",
            Attributes: {"id": "test"},
            children: null
        }
    }

}
kevinhendricks commented 7 years ago

See examples/serialize.cc that uses recursion to walk and rebuild the tree here: https://github.com/google/gumbo-parser/blob/master/examples/serialize.cc

Also see https://github.com/google/gumbo-parser/pull/392/commits/7f73b3b836ae75bb40c3ce1bff46c1ac913a2cae

A pull request that includes a non-recursive tree traversal routine.

magna25 commented 7 years ago

Thanks

magna25 commented 7 years ago

@kevinhendricks found the recursive function very useful but was wondering if there was anyway to determine the relationship (parent and child) between the nodes. Is there like has_children() method or something?

kevinhendricks commented 7 years ago

In the recursion routine serialize_contents processes each child of the parent it is called from in serialize. Please examine gumbo.h for info on the node structure, how children are stored, and how access pointer back to parent is stored.

data-man commented 7 years ago

Try MyHTML or Modest and be happy.