khajavi / pandoc

Automatically exported from code.google.com/p/pandoc
GNU General Public License v2.0
0 stars 0 forks source link

htmlreader takes really long to parse modest file #255

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. wget 
http://craphound.com/down/Cory_Doctorow_-_Down_and_Out_in_the_Magic_Kingdom.htm

2. mv Cory_Doctorow_-_Down_and_Out_in_the_Magic_Kingdom.htm down.html

3. pandoc +RTS -p -RTS --parse-raw -t native down.html

What is the expected output? What do you see instead?
An other file [1] of similiar size and complexity, I presume, takes about a 
second or two to convert.
[1] http://craphound.com/est/Cory_Doctorow_-_Eastern_Standard_Tribe.html

But on down.html pandoc used up to an hour of cpu-time without showing any 
result.

At first I thought the html reader runs into some kind of infinite tail 
recursion, but when I tried to produce some minimal example I found out that if 
I split the file into several junks then it can convert each junk. It still 
runs a few minutes on each junks but finally delivers something.
This makes me think that maybe the html reader does not actually hang but does 
something of exponantial (or worse) complexity.

Unfortunately I haven't been able to locate the cause of the prolbem in the 
file.

What version of the product are you using? On what operating system?
I tried three versions: debian lenny, latest release and git as of a
few days ago. All three versions show the same behaviour. OS is
debian lenny i386 and amd64.

Please provide any additional information below.
I tried profiling pandoc while converting a good file [1] and the
offending file [2] (aborted after a few minutes by Ctrl-C).

[1] http://www.unet.univie.ac.at/~a0300802/files/est.prof
[2] http://www.unet.univie.ac.at/~a0300802/files/down.prof

If I read the profiling output right then some functions have several hundred 
million entries - for a file only a few hundred kBs in size.

Please let me know if there is anything else I can do to help find the problem.

Harald

Original issue reported on code.google.com by hge...@users.sourceforge.net on 1 Sep 2010 at 4:42

GoogleCodeExporter commented 9 years ago
Thanks for the bug report. Yes, it sounds like an exponential blowup.

I experimented around a bit and found the following:  If you remove the 
DOCTYPE, HTML, and BODY tags (open and close) from the original document, then 
it converts quickly.  Not sure why that would make a difference, but it's a 
good clue.

Original comment by fiddloso...@gmail.com on 1 Sep 2010 at 6:14

GoogleCodeExporter commented 9 years ago
Some further piece to the puzzle: If I remove all the content between <ol> tags 
together with the tags then conversion is much faster.

Original comment by hge...@users.sourceforge.net on 5 Sep 2010 at 10:42

GoogleCodeExporter commented 9 years ago
If you convert the file to xhtml with tidy, then pandoc converts it in about a 
second:
tidy -utf8 -asxhtml doctorow.html | pandoc -f html -t markdown

I've added heuristics to pandoc so that it can handle non-closed tags and other 
malformed xhtml (which might be well-formed html of course).  This case is 
apparently defeating my heuristics, and I'd still like to figure out how to 
improve them. But for practical purposes, you might make a point of converting 
files like this with tidy before running them through pandoc.

Original comment by fiddloso...@gmail.com on 11 Sep 2010 at 3:05

GoogleCodeExporter commented 9 years ago
Thanks, that does help. I guess the problem here is the combination of nested 
ol tags and non-closed li tags. It seems tagsoup handles that case fine, but 
I'll have to recheck my results.

Original comment by hge...@users.sourceforge.net on 11 Sep 2010 at 8:31

GoogleCodeExporter commented 9 years ago
It's not that simple, because if you cut and paste the whole OL section into 
another file, pandoc can handle it.  So there's some odd effect of the context. 
 Tough to debug this kind of thing.

Original comment by fiddloso...@gmail.com on 12 Sep 2010 at 1:23

GoogleCodeExporter commented 9 years ago
I've completely rewritten the HTML reader, using TagSoup as a lexer.  Now 
pandoc can read the problematic file linked above without trouble.  So I'm 
closing this bug.

Original comment by fiddloso...@gmail.com on 15 Jan 2011 at 3:33