Bug: outputs "corrected" HTML, even if input was sloppy

mgdm / htmlq

Like jq, but for HTML.

MIT License

7.09k stars 111 forks source link

HTMLQ "purifies" incorrect HTML, even when that isn't desirable.

Example input:

<h3 class=subhead>Some Heading</h3>

When selecting the .subhead class as desired output, the heading is returned as:

<h3 class="subhead">Some Heading</h3>

That's fine if you want to render in a browser, but if you're using the result as a search and replace pattern to awk, sed, or fsed, as I am, the pattern will fail to match because of the quotes which htmlq added.

In short, HTMLQ is re-constructing the HTML to be more spec-correct, and by doing so it is breaking character-for-character matches between otherwise unchanged parts of the throughput.

N.B. While it would still be a problem, I wouldn't care about this so much if #36 was implemented.

Maybe htmlq needs a --purify or --no-purify option?

mgdm / htmlq

Bug: outputs "corrected" HTML, even if input was sloppy #53