mgdm / htmlq

Like jq, but for HTML.
MIT License
7.09k stars 111 forks source link

Bug: outputs "corrected" HTML, even if input was sloppy #53

Open XLTechie opened 2 years ago

XLTechie commented 2 years ago

HTMLQ "purifies" incorrect HTML, even when that isn't desirable.

Example input:

<h3 class=subhead>Some Heading</h3>

When selecting the .subhead class as desired output, the heading is returned as:

<h3 class="subhead">Some Heading</h3>

That's fine if you want to render in a browser, but if you're using the result as a search and replace pattern to awk, sed, or fsed, as I am, the pattern will fail to match because of the quotes which htmlq added.

In short, HTMLQ is re-constructing the HTML to be more spec-correct, and by doing so it is breaking character-for-character matches between otherwise unchanged parts of the throughput.

N.B. While it would still be a problem, I wouldn't care about this so much if #36 was implemented.

Maybe htmlq needs a --purify or --no-purify option?

BobBorges commented 4 months ago

I also get extra tags in my output:

<table><tr>

becomes

<table><tbody><tr>

I'm using this tool to look at source documents, think sloppy html in the 10s or 100s of thousands of characters with no white space, line breaks or indentation, in order to figure out the structure and extract contents in a reasonable way. More than once now I'm pulling my hair out -- why can't I find the tbody elem -- only to find out these aren't in the source.

+1 for a flag option