gjtorikian / html-proofer

Test your rendered HTML files to make sure they're accurate.
MIT License
1.57k stars 196 forks source link

The regex in Attribute::URL.clean_url is very slow for JavaScript bookmarklet URLs #816

Closed alexwlchan closed 3 months ago

alexwlchan commented 5 months ago

I recently upgraded from HTML-Proofer 3 to 5, and noticed that it was taking an exorbitant length of time to run on a site that had previously been fine.

I eventually traced the issue to a page with a bookmarklet on it; a long javascript:… URL with various URL-encoded entities – trying to check that URL was taking a very long time (multiple minutes).

Steps to reproduce

Here's a short program that tries to lint an HTML file with the problematic link:

require 'html-proofer'

File.open('bookmarklet.html', 'w') do |f|
  f.puts '<a href="javascript:var%20ids%20=%20[%22omnibox%22,%20%22cards%22,%20%22welcome%22];var%20classes%20=%20[%22widget-viewcard%22,%20%22widget-zoom%22,%20%22watermark%22];var%20hidden%20=%20(window.getComputedStyle(document.getElementById(ids[0]))).getPropertyValue(%22display%22);if%20(hidden%20!==%20%22none%22)%20{var%20disp%20=%20%22none%22;}%20else%20{var%20disp%20=%20%22%22;}for%20(var%20i%20=%200;%20i%20&lt;%20ids.length;%20i++)%20{document.getElementById(ids[i]).style.display%20=%20disp;}for%20(var%20i%20=%200;%20i%20&lt;%20classes.length;%20i++)%20{var%20div%20=%20document.getElementsByClassName(classes[i]);for%20(var%20j%20=%200;%20j%20&lt;%20div.length;%20j++)%20{div[j].style.display%20=%20disp;}}">Toggle Google Maps</a>'
end

HTMLProofer.check_file(
  'bookmarklet.html', {
    checks: ['Links'],
    log_level: :debug
  }
).run

The output I see is:

$ ruby repro.rb
Running 1 check (Links) in bookmarklet.html on *.html files ...

Running Links in bookmarklet.html

but it never actually completes.

I ran this with the latest version of HTML-Proofer and Ruby:

$ bundle exec htmlproofer --version
5.0.8

$ ruby --version
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [arm64-darwin23]

The output I see is:

Debugging analysis

I added a bunch of print statements to determine where it was going and where it was getting stuck. This stuff may be obvious to people who know the library, but this is a new codebase to me:

Here's where that log message comes from: https://github.com/gjtorikian/html-proofer/blob/1c8b2c12474d3e60401dd6200c2c72833af2ac31/lib/html_proofer/runner.rb#L130-L134

In this case check is a link check. Follow that down a level, I found it stalled on this line: https://github.com/gjtorikian/html-proofer/blob/1c8b2c12474d3e60401dd6200c2c72833af2ac31/lib/html_proofer/check/links.rb#L8

Here node is the <a> tag, I think?

If you follow the chain of function calls, create_element(node) calls Element.new(@runner, node, base_url: base_url), and following that down I found the issue is somewhere in Attribute::Url – it gets stuck on this line: https://github.com/gjtorikian/html-proofer/blob/1c8b2c12474d3e60401dd6200c2c72833af2ac31/lib/html_proofer/element.rb#L19

And if you go and look at how Attribute::Url works, you see that the initialiser calls clean_url!, and that turns out to be the root of the slowness: https://github.com/gjtorikian/html-proofer/blob/1c8b2c12474d3e60401dd6200c2c72833af2ac31/lib/html_proofer/attribute/url.rb#L223-L228

We can verify this by running my URL against the regex as a standalone program, and observing that it takes a very long time:

url = "javascript:var%20ids%20=%20[%22omnibox%22,%20%22cards%22,%20%22welcome%22];var%20classes%20=%20[%22widget-viewcard%22,%20%22widget-zoom%22,%20%22watermark%22];var%20hidden%20=%20(window.getComputedStyle(document.getElementById(ids[0]))).getPropertyValue(%22display%22);if%20(hidden%20!==%20%22none%22)%20{var%20disp%20=%20%22none%22;}%20else%20{var%20disp%20=%20%22%22;}for%20(var%20i%20=%200;%20i%20&lt;%20ids.length;%20i++)%20{document.getElementById(ids[i]).style.display%20=%20disp;}for%20(var%20i%20=%200;%20i%20&lt;%20classes.length;%20i++)%20{var%20div%20=%20document.getElementsByClassName(classes[i]);for%20(var%20j%20=%200;%20j%20&lt;%20div.length;%20j++)%20{div[j].style.display%20=%20disp;}}"

puts url =~ /^([!#{Regexp.last_match(0)}-;=?-\[\]_a-z~]|%[0-9a-fA-F]{2})+$/

This immediately starts consuming 100% of my Mac's CPU, and has been since I started writing this issue. It has yet to complete.

Suggested fix

This regex first appeared in #394. I'm not sure what it's meant to be doing.

I'm also not sure what clean_url! is for; I wonder if it should be doing anything with javascript: URLs? Is there anything HTML-Proofer can do with them downstream? If not, maybe it should just skip any URL starting javascript: here?

Notes

This problem occurs even if I add the data-proofer-ignore attribute to my <a> tag – that was mildly surprising to me.

gjtorikian commented 4 months ago

Hi, my apologies for not seeing this sooner. That regexp was, in fact, doing nothing, and was probably vestigial. Removing it in #820 still lets the test suite pass, so I'll release this fix in the next patch release this week.

This problem occurs even if I add the data-proofer-ignore attribute to my <a> tag – that was mildly surprising to me.

Hm, I tried to recreate this and couldn't. If you have another test case after the fix above is merged, I'll take a look again.

alexwlchan commented 3 months ago

I tried to recreate this and couldn't. If you have another test case after the fix above is merged, I'll take a look again.

Sorry for not including working code. Here's a modified version of the original repro, which would take a long time to run in 5.0.8 even though the <a> tag has the data-proofer-ignore attribute:

require 'html-proofer'

File.open('bookmarklet.html', 'w') do |f|
  f.puts '<a href="javascript:var%20ids%20=%20[%22omnibox%22,%20%22cards%22,%20%22welcome%22];var%20classes%20=%20[%22widget-viewcard%22,%20%22widget-zoom%22,%20%22watermark%22];var%20hidden%20=%20(window.getComputedStyle(document.getElementById(ids[0]))).getPropertyValue(%22display%22);if%20(hidden%20!==%20%22none%22)%20{var%20disp%20=%20%22none%22;}%20else%20{var%20disp%20=%20%22%22;}for%20(var%20i%20=%200;%20i%20&lt;%20ids.length;%20i++)%20{document.getElementById(ids[i]).style.display%20=%20disp;}for%20(var%20i%20=%200;%20i%20&lt;%20classes.length;%20i++)%20{var%20div%20=%20document.getElementsByClassName(classes[i]);for%20(var%20j%20=%200;%20j%20&lt;%20div.length;%20j++)%20{div[j].style.display%20=%20disp;}}" data-proofer-ignore>Toggle Google Maps</a>'
end

HTMLProofer.check_file(
  'bookmarklet.html', {
    checks: ['Links'],
    log_level: :debug
  }
).run

But in 5.0.9 both this and the original repro complete immediately, so I consider this done. Thanks!