taoqf / node-html-parser

A very fast HTML parser, generating a simplified DOM, with basic element query support.
MIT License
1.11k stars 107 forks source link

Bug when parsing <![CDATA[]]> tag which contains <> (angle brackets) #263

Open navrkald opened 7 months ago

navrkald commented 7 months ago

How to reproduce the issue:

import { parse } from "node-html-parser";

console.log(
      parse(
        `<ac:structured-macro
          ac:name="code"
          ac:schema-version="1"
          ac:macro-id="some id">
            <ac:parameter ac:name="language">bash</ac:parameter>
            <ac:plain-text-body>
              <![CDATA[
              export AWS_ACCESS_KEY_ID=<your Access key ID> export AWS_SECRET_ACCESS_KEY=<your Secret access key>
              ]]>
            </ac:plain-text-body>
        </ac:structured-macro>
        <p><br/></p>`
      ).toString()
    );

Output of such program is:

     <ac:structured-macro           ac:name="code"
              ac:schema-version="1"
              ac:macro-id="some id">
                <ac:parameter ac:name="language">bash</ac:parameter>

                  <![CDATA[
                  export AWS_ACCESS_KEY_ID=<your Access key ID> export AWS_SECRET_ACCESS_KEY=</your>
                  ]]>

            <p><br></p></ac:structured-macro>

There is problem it have crippled both content of CDATA (</your>) but as well it get confused and crippled rest of the html. It have completely swallowed tag plus it crippled ending tag </ac:structured-macro> which should end immediately after , but was moved to the end of html.

If I remove angle brackets <> from the content of CDATA tag html is parsed and printed correctly.

Expected results:

Is it will not try to anyhow interpret angle brackets inside <![CDATA[]]> tag and will parse HTML correctly.

Note:

This is just small part of large html page which get's whole crippled because of this bug.