squizlabs / HTML_CodeSniffer

HTML_CodeSniffer is a client-side JavaScript application that checks a HTML document or source code, and detects violations of a defined coding standard. Comes with standards that cover the three conformance levels of the W3C's Web Content Accessibility Guidelines (WCAG) 2.0 and the U.S. Section 508 legislation.
https://squizlabs.github.io/HTML_CodeSniffer/
BSD 3-Clause "New" or "Revised" License
1.12k stars 246 forks source link

Strange conflicts with JS timers #198

Open micahsaul opened 7 years ago

micahsaul commented 7 years ago

Discovered while working on https://github.com/18F/domain-scan and https://github.com/18F/pulse:

When running https://github.com/pa11y/pa11y against https://portal.hud.gov/hudportal/HUD, the scan locks up. I raised an issue there (https://github.com/pa11y/pa11y/issues/274) and in further investigation they noticed two things: 1) The lock appears to happen when the page has a particular js method in the body onload, which involves a timer in the global scope; and 2) HTML_CodeSniffer is locking up on the same page, and they rely on HTML_CodeSniffer for their scans.

Let me know if you need any more information!

cc: @hollsk @rowanmanning

ironikart commented 7 years ago

I was able to replicate with just the source HTML of that page so we can probably rule out conflicts with other (external) JS running on the page as the direct cause. What I did find on line 993 of the source code was a (possible) stray open div tag that when removed allowed HTMLCS to run to completion. I've tried isolating it but it has some odd interaction with a
tags starting at line 1700. Removing either the div or the break and HTMLCS runs fine.

The page has a number of HTML validation errors the combination of which seems to throw HTMLCS into an infinite loop.

A short term workaround for automation it may be best to check HTML validation first and bail, or convert the content to valid HTML before passing it to HTMLCS. Both of those don't really sit well with me since it happily parses many other invalid sites, and scanning fixed content may also obscure some issues because the source has been manipulated and is also very difficult to achieve with a headless browser.

Longer term, I will look at isolating this loop and implementing a timeout.

micahsaul commented 7 years ago

Oh, interesting. Thanks, I hadn't actually delved too deep into it myself. Let me know if there's anything else I can help with.

Side note, that I also made in the pa11y repo, is that sadly I have no control over the content of the site in question, but I will see if I can find a contact there to let them know they have a bunch of validation errors.