iabudiab / HTMLKit

An Objective-C framework for your everyday HTML needs.
MIT License
239 stars 27 forks source link

Best way to loop through the HTML contents? #14

Closed Sjoerdjanssenen closed 7 years ago

Sjoerdjanssenen commented 7 years ago

I've created a small testing app with HTMLKit inside it. I want to present a UITableView with all the HTML in it. I load the HTML as follows:

NSString *htmlString = @"<div><div><p>Test!</p></div><p>Test!</p><h1>HTMLKit</h1><p>Hello there!</p></div>";

 // Via parser
HTMLParser *parser = [[HTMLParser alloc] initWithString:htmlString];
HTMLDocument *document = [parser parseDocument];
HTMLElement *head = document.body;
[self.items addObject:[self addElement:head]];
HTMLElement *body = document.body;
[self.items addObject:[self addElement:body]];

And these are the addElement related functions:

- (Entry *)addElement:(HTMLElement *)element {
    Entry *entry = [[Entry alloc] init];

    if (element.childElementsCount > 0) {
        entry.tags = [self enumerate:element];
    }
    entry.tag = element.outerHTML;

    return entry;
}

- (NSArray *)enumerate:(HTMLElement *)element {
    NSMutableArray *items = [@[] mutableCopy];
    for (int i = 0; i < element.childElementsCount; i++) {
        HTMLElement *child = [element childElementAtIndex:i];

        [items addObject:[self addElement:child]];
    }

    return items;
}

This has the screenshot below as a result:

screen shot 2017-05-15 at 15 06 58

Ideally, I'd like for <body> to not contain the <div><div>..etc. What's the best way to achieve this with HTMLKit?

iabudiab commented 7 years ago

@Sjoerdjanssenen Hey there. The short answer would be textContent, I guess.

The long answer however, depends on what exactly you're trying to achieve.

The <body> of the DOM produced by your HTML string <div><div><p>Test!</p></div><p>Test!</p><h1>HTMLKit</h1><p>Hello there!</p></div> would look like this:

                     <div>
                       |
    +------------+-----------+-------------+
    |            |           |             |
  <div>         <p>         <h1>          <p>
    |            |           |             |
   <p>        "Test!"    "HTMLKit"   "Hello there!"
    |
 "Test!"

Now given this DOM, what should each row of the table contain?

Would be glad to help, but I'm not really sure what it is exactly you want 😺

Sjoerdjanssenen commented 7 years ago

Fair enough. Thanks for trying to help though! As you can see, my tableView already supports indention, so I guess it would need to look something like this:

+<div></div>
+---<div></div>
+------<p>Test!</p>
+---<p>Test!</p>
+---<h1>HTMLKit</h1>
+---<p>Hello there!</p>
iabudiab commented 7 years ago

@Sjoerdjanssenen Ok, I think I know now where this is going. Correct me if I'm wrong. You want to iterate the DOM in tree-order, i.e. depth first, and display the DOM in a tree-like way in the table.

Let's take it step by step.

The first problem with the output that you want is the assumption, that each element is either empty or contains one Text Node.

Take this HTML for example: <div><div><p>Test!</p></div>Hello<h1>HTMLKit</h1></div> i.e. the following DOM:

               <div>
                 |
    +------------+-----------+
    |            |           |
  <div>       "Hello"       <h1>
    |                        |
   <p>                   "HTMLKit"
    |
 "Test!"

What should the output for this case be? Because the first <div> element hast three child nodes. Is it something like this:

+<div></div>
+---<div></div>
+------<p>Test!</p>
+---Hello
+---<h1>HTMLKit</h1>

Here is another example for a DOM that can be created programmatically:

               <div>
                 |
    +------------+-----------+
    |            |           |
  <div>       "Hello"    "HTMLKit" 
    |
   <p>
    |
 "Test!"

The <div> contains three child nodes now, two of which are text nodes. So what is the output now? Is this correct:

+<div></div>
+---<div></div>
+------<p>Test!</p>
+---Hello
+---HTMLKit

But then this is not consistent with the <p> element, since its child text node is displayed inline and not in a separate row.

Check the Live DOM Viewer to see what I mean.

outerHTML & textContent

You can't use outerHTML here, becase it gives you the serialized HTML of the element including its descendants, more about it here in the MDN reference. And you can't use textContent directly either, because it also includes the descendant nodes, MDN reference.

HTMLNodeIterator

The best way to iterate the DOM in tree-order is to use the HTMLNodeIterator. You can use it like this:

let str = "<div><div><p>Test!</p></div><p>Test!</p><h1>HTMLKit</h1><p>Hello there!</p></div>"
let doc = HTMLDocument(string: str)
let body = doc.body!

let iterator = body.nodeIterator()

for node in iterator {
    print(node)
}

// The output would be something like this:
// <HTMLElement: 0x7fecc85b3b20 <body>>
// <HTMLElement: 0x7fecc85b3d60 <div>>
// <HTMLElement: 0x7fecc85b3a80 <div>>
// <HTMLElement: 0x7fecc85b3f00 <p>>
// <HTMLText: 0x7fecc85b4850 "Test!">
// <HTMLElement: 0x7fecc85b4d30 <p>>
// <HTMLText: 0x7fecc85b4e80 "Test!">
// <HTMLElement: 0x7fecc85b4fa0 <h1>>
// <HTMLText: 0x7fecc85b50f0 "HTMLKit">
// <HTMLElement: 0x7fecc85b5470 <p>>
// <HTMLText: 0x7fecc85b5680 "Hello there!">

However, you lose the depth information along the way, since the iterator simply iterates the DOM as if it a was a flat list. Unfotunately there is no property specified in the HTML DOM for a node's depth.

Recursion

Another approach would be to accumulate all the nodes recursively. Here is one such variant:

struct Entry {
    let depth: Int
    let contents: String
}

func visit(_ node: HTMLNode, depth: Int) -> [Entry] {
    let contents = { () -> String in
        switch node.nodeType {
        case .element: return "<\((node as! HTMLElement).tagName)>"
        default: return node.textContent
        }
    }()

    let entry = Entry(depth: depth, contents: contents)
    let children = node.childNodes.flatMap { visit($0 as! HTMLNode, depth: depth + 1) }
    return [entry] + children
}

let str = "<div><div><p>Test!</p></div><p>Test!</p><h1>HTMLKit</h1><p>Hello there!</p></div>"
let doc = HTMLDocument(string: str)
let body = doc.body!
let entries = body.childNodes.flatMap { visit($0 as! HTMLNode, depth: 0) }

entries.forEach { entry in
    let prefix = String(repeating: "---", count: entry.depth)
    print("+\(prefix)\(entry.contents)")
}

// The output:
// +<div>
// +---<div>
// +------<p>
// +---------Test!
// +---<p>
// +------Test!
// +---<h1>
// +------HTMLKit
// +---<p>
// +------Hello there!

Alternatives

In general, DOM traversal in HTMLKit can be done via:

Let me know if I you need any further help 😉

iabudiab commented 7 years ago

@Sjoerdjanssenen Hey there! How is it going with this issue? Do you need any further help or can I close it?

iabudiab commented 7 years ago

@Sjoerdjanssenen I'll this for now. Feel free to reopen.