Open sthelen opened 3 years ago
@sthelen, thank you for the issue, and the sample to reproduce the problem...
Seems everything is ok, until tidy gets to R=7 C=1856
...
Running a special DEBUG
version of tidy, which write a log, below is the log output at that point, and several loops thereafter...
R=7 C=1851: Returning lex-token node <dd> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning endtag node <table> stream
Entering ParseBlock 1... 42 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 43 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 44 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 45 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 46 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 47 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
...
Notice that number before the dd
continues to mount... 45, 46, 47, ... and I killed the process nearly 1/2 million iterations later -
...
Entering ParseBlock 1... 441594 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 441595 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
R=7 C=1856: Returning lex-token node <table> lexer
Entering ParseBlock 1... 441596 dd
R=7 C=1856: Returning lex-token node <table> lexer
At the same time, the debugger showed an ever increasing memory usage - it had reached over 100 MB at the time I stopped it... and of course, had I left it running, would eventually run out of memory... and maybe stopped by the system...
So I restarted it... at nearly 2 million loops, it had consumed over 1/2 GB of memory -
Entering ParseBlock 1... 1968300 dd
R=7 C=1856: Returning lex-token node <table> lexer
Exit ParseBlock 2 0...
And assume it would fail at 2, 4, ++... GB... whatever the actual process max. is...
But it is not a memory problem... it is an infinite loop problem... that leads to memory probs... if you see what I mean...
Need to find exactly the sequence that causes this forever looping... and then figure out why, how to break it...
At row 7, column 1856, it is processing <table><dd></dd></table>
, just before the </table>
, but feeding that snippet to tidy, and it does not trigger the infinite loop... so need to look deeper... will keep at it, as time permits...
Any help, feedback, much appreciated... thanks...
With a little python helper and some further manual trial and error, I managed to find a true minimal example that still appears to reproduce the bug (I did kill tidy after running for 10 seconds):
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<thead><dl><table><dd></dd></table></dl></thead>
</body>
</html>
Well, actually <thead><dl><table><dd></dd></table></dl></thead>
is enough. No need for the full HTML boilerplate.
@sthelen thank you for finding a minimal sample, to cause this infinite loop - this really helps with the debugging...
FWIIW, keeping the HTML boilerplate is not a bad thing - since it makes it easy to present to the W3C validator, to see what it says - but either are great.. thanks...
In tracing through the code, in the debugger, I was struck by the comment, and code fix, found there, referring to an infinite loop
, reported back in 2005 -
/* Do not match across BODY to avoid infinite loop
between ParseBody and this parser,
See http://tidy.sf.net/bug/1098012. */
At that point, if libTidy
were to discard this end, </dd>
, tag, as tidy does with all end tags, then no infinite loop is set up... just, how to detect, and do that, without damaging other things, causing a regression, etc, etc?
Usually, marching back up the parent tags stored, tidy should find the opening tag, discard this close tag, and exit loop...
But in this case the opening <dd>
has been moved out of the current parent
list, so is not found...
Hmmmm, ran out of ideas, so will have to come back to this later...
Meantime, look forward to further feedback, ideas, patches, etc... we certainly need to kill this infinite loop... thanks...
I have stumbled upon a web page online that appears to overwhelm tidy somehow.
Observed behavior: Tidy keeps processing the page until it exhausts all available memory. This appears to happen in a tight loop, as the process saturates one core while acquiring more and more memory. The OS has to step in and kill the process.
Expected behavior: tidy exits quickly with an error message.
Further info I managed to reduce the page to basically a single line with some boilerplate around, which still reproduces the behavior. Admittedly, that single line is a large and very strange collection of nested html tags wrapped in a
<div style="display:none">
block.There is still potential to reduce that single line further. I did some manual attempts at finding a true minimal example, but gave up after a while: reduced.html.zip. I did remove some blocks of tags at the front and end of that line, while still reproducing the observed memory issue. Interestingly, when removing what is now at the front (
<b><td></td></b>
), tidy bails out quickly, telling me that the document is too broken to fix. But when I remove all other tags and keep this little portion, that does not reproduce the issue. So there is some complex dependencies at play here.