Closed GoogleCodeExporter closed 8 years ago
Thanks for the bug report. Yes, it sounds like an exponential blowup.
I experimented around a bit and found the following: If you remove the
DOCTYPE, HTML, and BODY tags (open and close) from the original document, then
it converts quickly. Not sure why that would make a difference, but it's a
good clue.
Original comment by fiddloso...@gmail.com
on 1 Sep 2010 at 6:14
Some further piece to the puzzle: If I remove all the content between <ol> tags
together with the tags then conversion is much faster.
Original comment by hge...@users.sourceforge.net
on 5 Sep 2010 at 10:42
If you convert the file to xhtml with tidy, then pandoc converts it in about a
second:
tidy -utf8 -asxhtml doctorow.html | pandoc -f html -t markdown
I've added heuristics to pandoc so that it can handle non-closed tags and other
malformed xhtml (which might be well-formed html of course). This case is
apparently defeating my heuristics, and I'd still like to figure out how to
improve them. But for practical purposes, you might make a point of converting
files like this with tidy before running them through pandoc.
Original comment by fiddloso...@gmail.com
on 11 Sep 2010 at 3:05
Thanks, that does help. I guess the problem here is the combination of nested
ol tags and non-closed li tags. It seems tagsoup handles that case fine, but
I'll have to recheck my results.
Original comment by hge...@users.sourceforge.net
on 11 Sep 2010 at 8:31
It's not that simple, because if you cut and paste the whole OL section into
another file, pandoc can handle it. So there's some odd effect of the context.
Tough to debug this kind of thing.
Original comment by fiddloso...@gmail.com
on 12 Sep 2010 at 1:23
I've completely rewritten the HTML reader, using TagSoup as a lexer. Now
pandoc can read the problematic file linked above without trouble. So I'm
closing this bug.
Original comment by fiddloso...@gmail.com
on 15 Jan 2011 at 3:33
Original issue reported on code.google.com by
hge...@users.sourceforge.net
on 1 Sep 2010 at 4:42