j9t / obsohtml

Detection of obsolete and proprietary HTML elements and attributes
https://www.npmjs.com/package/@j9t/obsohtml
Creative Commons Attribution Share Alike 4.0 International
60 stars 0 forks source link

Obsolete attributes regex doesn't account for valueless attribute use #3

Closed mattbrundage closed 2 months ago

mattbrundage commented 2 months ago

Valueless attribute use is common with boolean attributes. Among your list of obsolete attributes, "noshade" and "nowrap" need special handling to account for scenarios such as <th nowrap>, but while also avoiding false positives such as <th title="The nowrap attribute is obsolete">.

Using regex on HTML is a minefield.

j9t commented 2 months ago

Yes! This must be caught and dealt with appropriately. Will work on an update, thanks!

FabianBeiner commented 2 months ago
const attributeRegex = new RegExp(`<[^>]*\\b${attribute}\\b(?=\\s*(=|\\s*[/]*>))`, 'i');

maybe?

j9t commented 2 months ago

Thanks @FabianBeiner for the proactive feedback! Was just about to jump at this after preparing a (basic) test, but that regex seems to work!

Prepared PR #4 for this. Feel free to review and leave feedback—as the whole project is an AI test case, I’m working with and testing a number of tools to support, too.

j9t commented 2 months ago

Wanted to make more time available for review, but accidentally merged the PR—still open to feedback if you have more thoughts! Thanks for the report, @mattbrundage, and the update, @FabianBeiner!

mattbrundage commented 2 months ago

@FabianBeiner your solution is an incremental improvement, but still matches sequences such as <th class=nowrap>, which happens to be a common convention in my own projects.

FabianBeiner commented 2 months ago

@FabianBeiner your solution is an incremental improvement, but still matches sequences such as <th class=nowrap>, which happens to be a common convention in my own projects.

Oh, the infamous unquoted attribute value syntax, of course I forgot about that. 🙈 Which brings us back to your original words:

Using regex on HTML is a minefield.

However, here is an update:

const attributeRegex = new RegExp(`<[^>]*\\s${attribute}\\b(\\s*=\\s*(?:"[^"]*"|'[^']*'|[^"'\\s>]+))?\\s*(?=/?>)`, 'i');

With attributes, we most likely will see a space before them, and I tried to consider that people might also use ' instead of " or nothing at all.

This should not work on

<th nowrap>
<th nowrap=nowrap>
<th nowrap="nowrap">
<th nowrap='nowrap'>
<th class="nowrap" nowrap>
<th class="nowrap" nowrap=nowrap>
<th class="nowrap" nowrap="nowrap">
<th class="nowrap" nowrap='nowrap'>

but not on

<th class="something nowrap">
<th class=nowrap>
<th title=nowrap>
<th title="The nowrap attribute is obsolete">

🤞🏻

mattbrundage commented 2 months ago

@FabianBeiner LGTM

image

j9t commented 2 months ago

(Reopening, will review! Thanks for the updates…!)

j9t commented 2 months ago

Prepared #9 for this, including a better test case (these may still be poor, but this one should finally catch what you described, @mattbrundage). The new regex seems to work well here, @FabianBeiner!

(Will let this sit for a moment and not merge right away.)

j9t commented 2 months ago

Just pushed 1.6.2, which contains improvements.

Can imagine this needs more refining, but rely on issue reports right now to take care of it. Plan to review and extend the tests to cover more cases there.