Closed mattbrundage closed 2 months ago
Yes! This must be caught and dealt with appropriately. Will work on an update, thanks!
const attributeRegex = new RegExp(`<[^>]*\\b${attribute}\\b(?=\\s*(=|\\s*[/]*>))`, 'i');
maybe?
Thanks @FabianBeiner for the proactive feedback! Was just about to jump at this after preparing a (basic) test, but that regex seems to work!
Prepared PR #4 for this. Feel free to review and leave feedback—as the whole project is an AI test case, I’m working with and testing a number of tools to support, too.
Wanted to make more time available for review, but accidentally merged the PR—still open to feedback if you have more thoughts! Thanks for the report, @mattbrundage, and the update, @FabianBeiner!
@FabianBeiner your solution is an incremental improvement, but still matches sequences such as <th class=nowrap>
, which happens to be a common convention in my own projects.
@FabianBeiner your solution is an incremental improvement, but still matches sequences such as
<th class=nowrap>
, which happens to be a common convention in my own projects.
Oh, the infamous unquoted attribute value syntax, of course I forgot about that. 🙈 Which brings us back to your original words:
Using regex on HTML is a minefield.
However, here is an update:
const attributeRegex = new RegExp(`<[^>]*\\s${attribute}\\b(\\s*=\\s*(?:"[^"]*"|'[^']*'|[^"'\\s>]+))?\\s*(?=/?>)`, 'i');
With attributes, we most likely will see a space before them, and I tried to consider that people might also use ' instead of " or nothing at all.
This should not work on
<th nowrap>
<th nowrap=nowrap>
<th nowrap="nowrap">
<th nowrap='nowrap'>
<th class="nowrap" nowrap>
<th class="nowrap" nowrap=nowrap>
<th class="nowrap" nowrap="nowrap">
<th class="nowrap" nowrap='nowrap'>
but not on
<th class="something nowrap">
<th class=nowrap>
<th title=nowrap>
<th title="The nowrap attribute is obsolete">
🤞🏻
@FabianBeiner LGTM
(Reopening, will review! Thanks for the updates…!)
Prepared #9 for this, including a better test case (these may still be poor, but this one should finally catch what you described, @mattbrundage). The new regex seems to work well here, @FabianBeiner!
(Will let this sit for a moment and not merge right away.)
Valueless attribute use is common with boolean attributes. Among your list of obsolete attributes, "noshade" and "nowrap" need special handling to account for scenarios such as
<th nowrap>
, but while also avoiding false positives such as<th title="The nowrap attribute is obsolete">
.Using regex on HTML is a minefield.