taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.12k stars 112 forks source link

Improperly closed tags break output #152

Closed AlenToma closed 2 years ago

AlenToma commented 3 years ago

I am unable to parse the current html from some reason

<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj
</div>
</div>
`

My current Code is

  const validate =()=> {
       html = html.replace(/<!DOCTYPE html>/g, "").replace(/[[class]]/g, "").replace(/[[id]]/g, "")
       var container = parse("<div>" + html + "</div>");
       var content = container.querySelector("#chr-content");
       console.log(content != null ? "found" : "not found")
  }

This is on Node.js with the latest version 4.1.4

Here is snack example I created that contain the problem if you would like to test it and see.

snack

Right now I am using react-native-html-parser together with this library to be able to fix the incorrect html that contains no end tags.

it seems that node-html-parser simple ignore and rewrite the html and remove <div id="chr-content"> from some reason.


Contributor's Note

Although this library was built with the known limitation of requiring proper HTML, we are looking at revising the logic in a way which will not impact performance but will be able to more reasonably handle issues of unmatched open and close tags.

This issue will be left open until that has been addressed

— @nonara

taoqf commented 3 years ago

This lib is not suppose to deal with incorrect html. I am so sorry for that. If you could fix this, I am happy to merge you pr.

AlenToma commented 2 years ago

I have checket the code and noticed something wrong in the code that couse the output to break.

Check this line

// Single error  <div> <h3> </div> handle: Just removes <h3>
oneBefore.removeChild(last);

This will remove the child and its content, why dont we just close it ?

AlenToma commented 2 years ago

Hi again. I inspected the code above and did some test.

I do not know why you really remove the last element since you already found the none closed tags.

Anyway here is a possible solution that worked

I added parseNoneClosedTags option

and changed the code to below

export function parse(data: string, options = { lowerCaseTagName: false, comment: false } as Partial<Options>) {
    const stack = base_parse(data, options);
    const [root] = stack;
    while (stack.length > 1) {
        // Handle each error elements.
        const last = stack.pop();
        const oneBefore = arr_back(stack);
        if (last.parentNode && last.parentNode.parentNode) {
            if (last.parentNode === oneBefore && last.tagName === oneBefore.tagName) {
                // Pair error case <h3> <h3> handle : Fixes to <h3> </h3> 
                // this is wrong, becouse this will put the H3 outside the current right position which should be inside the current Html Element, see issue 152 for more info
                if (options.parseNoneClosedTags !== true) {
                    oneBefore.removeChild(last);
                    last.childNodes.forEach((child) => {
                        oneBefore.parentNode.appendChild(child);
                    });
                    stack.pop();
                } 

            } else {

                // Single error  <div> <h3> </div> handle: Just removes <h3>
                // Why remove? this is already a HtmlElement and the missing <H3> is already added in this case. see issue 152 for more info
                if (options.parseNoneClosedTags !== true) {
                    oneBefore.removeChild(last);
                    last.childNodes.forEach((child) => {
                        oneBefore.appendChild(child);
                    });
                }
            }
        } else {
            // If it's final element just skip.
        }
    }

    return root;
}

And here is the test for this issue which passed.

const { parse } = require('@test/test-target');

describe('issue 152', function () {
    it('shoud parse attributes right', function () {
        const html = `<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj
</div>
</div>`;
        const expected = `<div>
<div id="chr-content">
<span>
  lkjasdkjasdkljakldj

</span></div></div>`;

        const root = parse(html, { parseNoneClosedTags: true });
        root.toString().should.eql(expected);
        // const div = root.firstChild;
        // div.getAttribute('#input').should.eql('');
        // div.getAttribute('(keyup)').should.eql('applyFilter($event)');
        // div.getAttribute('placeholder').should.eql('Ex. IMEI');
        // root.innerHTML.should.eql(html);
    });
});

could you please have a look and let me know if this could work, and even better if it did then please check it in and publish it on npm so we could use it.

taoqf commented 2 years ago

merged your code in v5.2.2

VityaSchel commented 1 year ago

Please tell me if there is a library that can parse malformed html