taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

[Doc] Clarify wrapper behaviour of parse() #153

Open DanKaplanSES opened 3 years ago

DanKaplanSES commented 3 years ago

Here is a failing test case:

    it("reproduces error", () => {
      const html = `
      <html>
        <body>
          <div>
            <p>b</p>
            <p>c</p>
          <div>
        </body>
      </html>`

      const root = parse(html)
      root.querySelector('div').childNodes.unshift(parse('<p>a</p>'));

      expect(root.querySelector("p:nth-of-type(2)").rawText).toEqual("b");
/*
    Expected: "b"
    Received: "c"

      28 |       root.querySelector('div').childNodes.unshift(parse('<p>a</p>'));
      29 |
    > 30 |       expect(root.querySelector("p:nth-of-type(2)").rawText).toEqual("b");
         |                                                              ^
      31 |     });
*/
    });

My workaround was to use childNodes[0].set_content(newContent + oldContent).

The reason I wrote this code in the first place is because I couldn't figure out from the documentation how to prepend content. I looked at the tests and noticed some of them were using childNodes[0] and figured, if you can index that array, you should be able to call other functions on it and have it work.

DanKaplanSES commented 3 years ago

I found another bug using my workaround: it seems to nest another div around everything. I think there are a lot of bugs to find using similar patterns to the above.

taoqf commented 3 years ago
const root = parse(html);
root.querySelector('div').childNodes.unshift(parse('<p>a</p>').firstChild);
DanKaplanSES commented 3 years ago

I don't understand why I need to do that firstChild. To me, it suggests that everything named root in the documentation is not really the root.

Regardless, finding a part of HTML and modifying it (or its neighbors) seems like a common use case. Can the readme be updated to show examples of this use case?

I don't know if you would prefer another issue for this, or for me to change the title of this one. Let me know.

taoqf commented 3 years ago

The root node is not really a node, it is a container of all the root nodes.

nonara commented 3 years ago

The node in question is a wrapper node. It is a commonly used pattern for html parsers, however, I think it makes sense to add something to the readme which explains the behaviour of parse.

If you want to change the issue title, I'll reopen if @taoqf is amenable to my doing a PR that clarifies that bit in the readme.