jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.48k stars 3.38k forks source link

htmlreader takes really long to parse modest file #112

Closed jgm closed 13 years ago

jgm commented 13 years ago

What steps will reproduce the problem?

  1. wget http://craphound.com/down/Cory_Doctorow_-_Down_and_Out_in_the_Magic_Kingdom.htm
  2. mv CoryDoctorow-_Down_and_Out_in_the_Magic_Kingdom.htm down.html
  3. pandoc +RTS -p -RTS --parse-raw -t native down.html

What is the expected output? What do you see instead? An other file [1] of similiar size and complexity, I presume, takes about a second or two to convert. [1] http://craphound.com/est/Cory_Doctorow_-_Eastern_Standard_Tribe.html

But on down.html pandoc used up to an hour of cpu-time without showing any result.

At first I thought the html reader runs into some kind of infinite tail recursion, but when I tried to produce some minimal example I found out that if I split the file into several junks then it can convert each junk. It still runs a few minutes on each junks but finally delivers something. This makes me think that maybe the html reader does not actually hang but does something of exponantial (or worse) complexity.

Unfortunately I haven't been able to locate the cause of the prolbem in the file.

What version of the product are you using? On what operating system? I tried three versions: debian lenny, latest release and git as of a few days ago. All three versions show the same behaviour. OS is debian lenny i386 and amd64.

Please provide any additional information below. I tried profiling pandoc while converting a good file [1] and the offending file [2](aborted after a few minutes by Ctrl-C).

[1] http://www.unet.univie.ac.at/~a0300802/files/est.prof [2] http://www.unet.univie.ac.at/~a0300802/files/down.prof

If I read the profiling output right then some functions have several hundred million entries - for a file only a few hundred kBs in size.

Please let me know if there is anything else I can do to help find the problem.

Harald

Google Code Info: Issue #: 255 Author: hge...@users.sourceforge.net Created On: 2010-09-01T16:42:11.000Z Closed On: 2011-01-15T03:33:29.000Z

jgm commented 13 years ago

Thanks for the bug report. Yes, it sounds like an exponential blowup.

I experimented around a bit and found the following: If you remove the DOCTYPE, HTML, and BODY tags (open and close) from the original document, then it converts quickly. Not sure why that would make a difference, but it's a good clue.

Google Code Info: Author: fiddloso...@gmail.com Created On: 2010-09-01T18:14:06.000Z

jgm commented 13 years ago

Some further piece to the puzzle: If I remove all the content between

    tags together with the tags then conversion is much faster.

    Google Code Info: Author: hge...@users.sourceforge.net Created On: 2010-09-05T22:42:49.000Z

jgm commented 13 years ago

If you convert the file to xhtml with tidy, then pandoc converts it in about a second: tidy -utf8 -asxhtml doctorow.html | pandoc -f html -t markdown

I've added heuristics to pandoc so that it can handle non-closed tags and other malformed xhtml (which might be well-formed html of course). This case is apparently defeating my heuristics, and I'd still like to figure out how to improve them. But for practical purposes, you might make a point of converting files like this with tidy before running them through pandoc.

Google Code Info: Author: fiddloso...@gmail.com Created On: 2010-09-11T03:05:26.000Z

jgm commented 13 years ago

Thanks, that does help. I guess the problem here is the combination of nested ol tags and non-closed li tags. It seems tagsoup handles that case fine, but I'll have to recheck my results.

Google Code Info: Author: hge...@users.sourceforge.net Created On: 2010-09-11T20:31:26.000Z

jgm commented 13 years ago

It's not that simple, because if you cut and paste the whole OL section into another file, pandoc can handle it. So there's some odd effect of the context. Tough to debug this kind of thing.

Google Code Info: Author: fiddloso...@gmail.com Created On: 2010-09-12T01:23:05.000Z

jgm commented 13 years ago

I've completely rewritten the HTML reader, using TagSoup as a lexer. Now pandoc can read the problematic file linked above without trouble. So I'm closing this bug.

Google Code Info: Author: fiddloso...@gmail.com Created On: 2011-01-15T03:33:29.000Z