tallforasmurf / PPQT

A post-processing tool for PGDP written in Python, PyQt4, and Qt
GNU General Public License v3.0
4 stars 2 forks source link

html convert hangs #141

Closed bibimbop closed 11 years ago

bibimbop commented 11 years ago

When I try to convert to html, I get this trace, and the process never finishes:

Traceback (most recent call last): File "ppqt/pqFlow.py", line 391, in htmlDocument self.theRealHTML(topBlock,endBlock) File "ppqt/pqFlow.py", line 1072, in theRealHTML mzs = QString(markupZ[markupCode]) KeyError: u'>'

I have shortened the book to this: https://dl.dropboxusercontent.com/u/94763902/misc/vercing%C3%A9torix.html

I think the converter doesn't like the spaces before or after the section numbers ('I'). If I insert a blank line before, I get a different trace.

tallforasmurf commented 11 years ago

Thank you for including a nice clean failing case! The immediate issue is that you have a spacing error: the chapter title here needs to be closed by 2 blank lines.


CHAPITRE II

LES DIEUX ARVERNES

/*
Natio est... admodum dedita religionibus.
<span class='smcap'>César</span>, <i>Guerre des Gaules</i>, VI, 16, § 1.
*/

/Q F:0 L:0
I. Auvergne et Campanie. — II. Dieux des bois, des sources
et des lacs. — III. Dieux des montagnes. — IV. Les grands
dieux et leurs résidences. — V. Teutatès au Puy de Dôme.
Q/

Either two blank lines between CHAPITRE II and LES DIEUX (if the latter is a sub-head) or after ARVERNES (if this is a two-line chapter head). When that is done it works. Without that, it takes everything to the next 2-line break as chapter, but that also includes some other markup, which causes the error.

This is all to do with the recently-added multi-line chapter title code. I will need to improve the logic to diagnose this: if any other markup appears inside a multiline chapter, there must be an error, give a message and back out.

bibimbop commented 11 years ago

The formatting is correct. The 2 blank lines are present, and in the right place. See my post and replies on dp forum, F2 fanatics team.

tallforasmurf commented 11 years ago

OK, this sucks. You send me back to the FG where I find this,

Put 4 blank lines before the "CHAPTER XXX". Include these blank lines even if the chapter starts on a new page; there are no 'pages' in an e-book, so the blank lines are needed. Then separate with a blank line each additional part of the chapter heading, such as a chapter description, opening quote, etc., and finally leave two blank lines before the start of the text of the chapter.

Judas fucking priest, when did they do that? Based on that, your example with the quote and the chapter summary is properly spaced. But how the hell does anybody expect that to be translated to HTML? Or parsed in any sensible way for any other purpose?

The only thing I can depend on is that 4-newlines starts a chapter head and 2-newlines ends it. I have no idea what PPQT should try to do with non-title items (quotes etc) that appear in that span. You can't nest a blockquote or a div inside an h2!

I'm thinking I could change the parser so that as soon as it sees some other markup begin (like your /* quote), it just closes the chapter head at that point. Pretend it sees 2 newlines. But wait, no, that would work for your examples, where the 2-newlines happens before a legitimate subhead. But if there were simply a quote, then 2-newlines and then normal text, the parser would see the text as a subhead and wrap that paragraph in h3.

As far as I can see now, PG has written a document spec that cannot be correctly parsed. Maybe I'll see another solution later (feel free to make suggestions) but for now, it requires 2-newlines after the actual head text and before any other markup, in other words, 2-newlines where </h2> should go.

bibimbop commented 11 years ago

I agree it's not easy. But ppqt should not hang, even if the result is so so. Here's what guiguts generate:

<h2><a name="CHAPITRE_II" id="CHAPITRE_II">CHAPITRE II</a></h2>

<p>LES DIEUX ARVERNES</p>

<p>
Natio est... admodum dedita religionibus.<br />
<span class='smcap'&gt;César</span>, <i>Guerre des Gaules</i>, VI, 16, § 1.<br />
</p>

<blockquote>

<p>I. Auvergne et Campanie. &#8212; II. Dieux des bois, des sources
et des lacs. &#8212; III. Dieux des montagnes. &#8212; IV. Les grands
dieux et leurs résidences. &#8212; V. Teutatès au Puy de Dôme.</p></blockquote>

<p>I</p>

<p>Contact avec la nature, c'était rapport avec les dieux.
tallforasmurf commented 11 years ago

In responding to #143 I took a look for the first time at GG's HTML conversion. When GG sees 4 empty lines followed by a nonempty line, it enters a loop in which it accumulates further nonempty lines until it sees a single empty line. The one-or-more consecutive nonempty lines are accumulated and saved as the body of the auto-generated TOC entry (which PPQT doesn't attempt). The <h2> markup goes above the first, and the </h2> goes after the last.

This is GG's solution: the actual chapter-title is the sequence of 1 or more nonempty lines following 4 empty lines, and a single empty line is the closing delimiter.

Also it appears from a casual reading that GG never inserts an <h3> markup for a subhead. It explicitly says, #open subheading with <p> and that appears to be what it does on a nonempty line after two empty lines.