tallforasmurf / PPQT

A post-processing tool for PGDP written in Python, PyQt4, and Qt
GNU General Public License v3.0
4 stars 2 forks source link

Use better para reflow algorithm #59

Closed tallforasmurf closed 12 years ago

tallforasmurf commented 12 years ago

Frau Sma points to the par package (www.nicemice.net/par) which has absolutely impenetrable doc and pretty C code that is not very helpfully commented -- anyway, it uses "dynamic programming" to achieve a nicer paragraph reflow than a simple greedy algorithm such as we are using. Problem with par is it has so many features (it can reflow boxed commentary, for example) and neither the doc nor the comments on its code describe its algorithm or methods.

In general the idea is to treat a paragraph as a whole (rather than token by token) and to minimize the sum of the squares of the difference in line lengths. If this could be reduced to e.g. pseudo-code it should be doable in pqFlow.

tallforasmurf commented 12 years ago

Knuth provides the general algorithm, for overview see http://en.wikipedia.org/wiki/Word_wrap . There's a perl implementation in CPAN. There is nothing in pypi that I can find (surprising). There is a python implementation at sourceforge (http://oedipus.sourceforge.net/texlib/).

However, Knuth's algorithm is for TeX and provides for the full range of typesetting abilities: "glue" (spaces) that can shrink or stretch; soft-hyphens so that candidate breakpoints can be within words; negative and positive "penalties" applied to some punctuation so as to encourage line breaks between clauses. Our problem in ASCII reflow is much much simpler. Fixed space sizes, breaks only at spaces.

The solution for a j-word paragraph using "dynamic programming" is O(j^2) but linear time algorithms have been published: links in the wikipedia article to purchase-only ($39) texts. Conceivably could look up at Stanford if wanted to. The wiki article on dynamic programming could be of help with study.

tallforasmurf commented 12 years ago

The original Knuth and Plass paper is here: http://onlinelibrary.wiley.com/doi/10.1002/spe.4380111102/abstract and the PDF can be downloaded. I'm working at decoding his pseudocode.

tallforasmurf commented 12 years ago

Fixed on commit # b06247 (and a subsequent change to pick off a bug).