sciurius / perl-Text-Layout

Pango style markup formatting for PDF::API2, Markdown, Cairo and more
2 stars 2 forks source link

extents #10

Closed PhilterPaper closed 2 years ago

PhilterPaper commented 2 years ago

Separately from the word-wrap issue, I drew a rectangle around the sample output, using the x,y (top left), width, and height numbers returned by get_extents(). The x, y, and height appear to fit quite well, but the width extends all the way to the right margin (595 pt width). It appears to not be trimming the right side of the extent, but carrying it out to the full width (the sample text is intended to be centered in the desired width). Is that working as intended, or is something wrong? It doesn't seem to be very useful if it's carried to the full width, including lots of blank space.

Is get_size() always the same width and height as in get_extents()? I didn't have to scale any of the returned values before using them in PDF calls, so apparently they are in points. Are the "pixel" versions supposed to be scaled differently? If not, are they redundant in a PDF application?

PhilterPaper commented 2 years ago

Another thought on extents: sometimes it might be useful to know the text baseline position(s) and not just the overall height of the text. For word-wrapped text (paragraphs), there would be multiple baseline offsets, so it would probably need its own method to return an array. I see there is a get_baseline() method, but it only talks about the "first line" -- do you iterate through the lines to get the offsets?

sciurius commented 2 years ago

This is per Pango specs. When you consider a multi-line block of text then only the baseline of the first line makes sense.

sciurius commented 2 years ago

The x, y, and height appear to fit quite well, but the width extends all the way to the right margin (595 pt width)

When I run e.g. tests/pdfapi1.pl I get boxes that precisely fit the text. See subroutine showbb in PDFAPI2.pm.

PhilterPaper commented 2 years ago

With some experimentation, it appears that hash element width is not really a width so much as it's the x value of the right edge (when not left-justified). In other words, if I want the true width of the text, I have to subtract x. However, height does appear to be the true height (at least when y is 0). It's non-intuitive, but I can work with it.

$at_x = 0;  $at_y = 600; # place text at PDF coordinates 0,600
# (x,y, w,h) = (155.3, 0, 439.7, 36) returned by get_extents() when field width is 595 and text centered
# draw bounding box LL to UR. x,y is UL
rect(x+$at_x, y+$at_y-h, w-x, h)
# or UL to LR ...
rectxy(x+$at_x, y+$at_y, w+$at_x, y+$at_y-h)

show($at_x, $at_y, context)

Is the height the lowest descender to highest ascender in the font(s) (for these font size(s)) rather than the actual text being used? It appears to be the case, as when I change "quick" to "vuick" the height is unchanged, even though there are no descenders.

I need to experiment with y being non-zero.

In pdfapi1.pl, I had to update the font stuff for Windows, as ITC Garamond is not available by default, and most TTF fonts are kept in \Windows\Fonts\:

    $fd->add_fontdirs( "/Windows/fonts/" ); # there is no $ENV{'HOME'} and dir structure different
    $fd->register_font( "arial.ttf",    "Garamond"               );
    $fd->register_font( "arialbd.ttf",  "Garamond", "Bold"       );
    $fd->register_font( "ariali.ttf",   "Garamond", "Italic"     );
    $fd->register_font( "arialbi.ttf",  "Garamond", "BoldItalic" );

Arial is a serif font.

sciurius commented 2 years ago

It's non-intuitive, but I can work with it.

There is an internal bbox function that, indeed, returns x, y, x+w, a.

I'll check the Pango specification to see whether the extents functions need to deliver an offset or a width.

Is the height the lowest descender to highest ascender in the font(s) (for these font size(s)) rather than the actual text being used?

Yes. This is a limitation of PDF::API2 that only provides a font bounding box, and not per glyph.

Arial is a serif font.

I'd say it is sans serif, but that doesn't really matter here.

sciurius commented 2 years ago

it appears that hash element width is not really a width so much as it's the x value of the right edge

I've checked and this is wrong. It should be the width. I'll fix it.

PhilterPaper commented 2 years ago

Arial is a serif font.

I'd say it is sans serif, but that doesn't really matter here.

You're right. I thought I recalled it showing serifs when I ran it, but I just went back and looked at it again and it's showing a sans-serif face (presumably Arial, as I specified arial.ttf etc. for the type fonts). If you want a Windows serif font there, times.ttf etc., or georgia.ttf would do the job.

sciurius commented 2 years ago

I've fixed issue #10 and made some other minor (I hope) modifications.

Attached are two perl programs pdfapi1.pl and pango1.pl that are for the most part identical and produce (near) identical output. (For visual inspection only, I need more tests before I check in the changes.)

pango1.pl.txt pdfapi1.pl.txt pango1.pdf pdfapi1.pdf

One of these days I'm going to start on the multi-line handling.

PhilterPaper commented 2 years ago
  1. Does using PDF::Builder produce the same results as PDF::API2? In theory they should, but it would be nice to catch any discrepancies early.
  2. I notice that the sizes of glyphs and x/y offsets are a little bit larger in pdfapi1.pdf than in pango1.pdf. They may be a very small scaling effect that needs to be accounted for.
  3. Is your showbb() method now using the corrected width value internally? You don't appear to be using it in the Perl examples.

Let me know if it would be useful to you to run any of your code on Windows. I don't have the ITC Garamond font available, so we would have to settle on something available on both Windows and Linux.

sciurius commented 2 years ago
  1. Yes, PDF::Builder produces identical results.
  2. It turns out that PANGO_UNITS is not 1000, but 1024. With the latter value I get near identical results (see below).
  3. Yes. get_entents now returns x, y, width and height as it should.
    PDF::API2:
    EXT: 0.00 0.00 436.10 66.54
    EXT: 158.90 0.00 436.10 66.54
    EXT: 79.45 0.00 436.10 66.54
    PDF::Builder:
    EXT: 0.00 0.00 436.10 66.54
    EXT: 158.90 0.00 436.10 66.54
    EXT: 79.45 0.00 436.10 66.54
    Pango:
    EXT: 0.00 0.00 437.00 67.00
    EXT: 158.00 0.00 437.00 67.00
    EXT: 79.00 0.00 437.00 67.00
PhilterPaper commented 2 years ago

One other value to check would be 1016, which is 40 units per mm. I ran into this conversion in a graphics system many moons ago where the resolution was 1016 per inch, which in turn came from 40 per mm. Just something to check.

sciurius commented 2 years ago

In this case, I consider this autorative: https://docs.gtk.org/Pango/const.SCALE.html

I'm not sure how I could have missed this in the first place ;).

PhilterPaper commented 2 years ago

Maybe it's just an exercise in numerology, but 67.00/66.54 = 1024/1017 (a hair larger than 1016). Are the extents you gave 3 days ago supposed to represent the same thing, e.g., points? 67.00/66.54 = 72.50/72.00, which (72.5 per inch) isn't close enough to Didot points (72.27, IIRC, or perhaps those are TeX Points) or any other standard Point. I'm just trying to figure out if Pango is using something other than the standard Big Point (72 per inch, used by PDF) for a size.

sciurius commented 2 years ago

For the time being I believe its rounding errors. Cairo uses a scaling of 1024, why? Because it can do integer arithmetic while retaining sufficient accuracy. Perl does real arithmetic and this can cause small differences especially with cascading calculations. I have reconstructed most of the font calculations that Pango performs and I almost have a reliable extents function to be added to PDF::API2/Builder.

sciurius commented 2 years ago

FYI, I added a get_extents method to HarfBuzz::Shaper.

=head2 $info = $hb->get_extents

Get the extents of the (shaped) buffer.

Upon completion an array of hashes is returned with one element for
each glyph.

The hash contains the following items:

    x_bearing; Distance from the x-origin to the left extremum of the glyph.
    y_bearing; Distance from the top extremum of the glyph to the y-origin.
    width;     Distance from the left extremum of the glyph to the right extremum.
    height;    Distance from the top extremum of the glyph to the bottom extremum.
    g:         glyph index in font (CId)

=cut
PhilterPaper commented 2 years ago

Sounds interesting. Since HS doesn't put down ink, it would often not be useful to give the full extents of an HS chunk, as it could be split across two or more lines. HS doesn't get involved with line-splitting. You would have to wait until some other library or routine (e.g., Text::Layout, Text::KnuthPlass) has split an HS chunk into lines, before you can go through the glyphs and find the extents. One important thing to keep track of is vertical above and below the baseline, so that you can keep track of different chunks wanting to move the text baseline up or down (you of course want to end up with a single baseline per line of text). For vertical writing systems, I don't know if there is an equivalent vertical baseline with horizontal extents. Lots of things to keep track of!

I still haven't figured out what to do about ligatures versus word-splitting (Text::Hyphen or equivalent, via Text::KnuthPlass). If you call HS first (replacing runs of individual characters with ligatures), word-splitting may not work (unable to handle ligatures, or will try to split in the wrong place or even through a ligature). Word-splitting first would be a problem, as the word length needs to be known, and ligatures usually reduce the word length, possibly shifting where a word is split. Have you any thoughts or best practices on this?

sciurius commented 2 years ago

Sounds a bit like the classical footnote-dilemma (it doesn't fit on the page where it refers to). I'm personally not very much into word splitting. Every day I read the newspapers I see words that are hyphenated wrong, and it is getting worse (here) now the dutch language is absorbing words from foreign languages, most notably english. The TeX approach of discretionary hyphens has been proven effective, but it is essentially a manual process. Intuitively I would suggest an approach where you add one glyph (after processing ligatures) at a time, keeping track of break points. I'm not familiar with breaking in the middle of ligatures.

Using PDF::API2 (or Builder) I still have a hard time to reliably get at the glyph metrics. For CIDfonts (including TTF and OTF) I use

    my $glyphs = $self->fontobj->{loca}->read->{glyphs};
    ....
    foreach $glyph ( ... ) {
            my $e = $glyphs->[$glyph];

This should give me the glyph bounding box, but often is just gives an empty hash. I'm inclined to skip the ink extents for non-harfbuzz cases.

sciurius commented 2 years ago

BTW, what do you think is most useful:

  1. method shaper gives the ax, ay, dx and dy and an additional method call get_extents fetches the x_bearing, y_bearing, width and height
  2. method shaper gives all x, ay, dx, dy, x_bearing, y_bearing, width and height
  3. method shaper gives the ax, ay, dx and dy and, if a particular feature*) was supplied, also x_bearing, y_bearing, width and height

Approach 2 will add a small performance penalty to the shaper call, neglectable w.r.t. the penalty for an additional method call. OTOH, most of the time the glyph metrics will not be required.

*) How to decide what feature and make sure it never collides with existing or future features.

PhilterPaper commented 2 years ago

Since a footnote can (usually) be split and the excess spilled to the next column (under a separator), the problem is really only with callouts appearing on the last line (or last two lines, if footnote separator line is used), where there is no room for even the first line of the footnote. In that case, you simply have to eat a shorter column (perhaps increasing the leading a bit to extend the column) and move on to the next column.

Good luck with the Dutch language absorbing English words. Turnabout is fair play, as I dine on my spick and span yacht while eating cole slaw and munching on a cookie, all while listening to The Boss. If it makes you feel any better, the British and Americans often can't agree on where to hyphenate (split) a word, as they use somewhat different rules (in addition to different spellings and even slightly different grammar). By the way, I understand that Dutch (and maybe German) has some odd word-splitting rules about adding or changing letters at the split (English doesn't).

Note that the Knuth-Plass paragraph-shaping algorithm tries hard to avoid splitting words, but sometimes it's necessary. The Knuth-Liang word-splitting algorithm is supposed to be pretty good, but even it needs a short dictionary of exceptions to override the algorithm. In TeX, you can mark discretionary splits (or forbid splits), but that's a lot of labor and should only be needed sparingly. I'm not quite sure what you're talking about with "adding one glyph at a time and keeping track of break points". Maybe after my morning coffee has kicked in, I'll see your point. Regarding split ligatures, a word might naturally split between two t's, but HS has decided to replace "tt" with a ligature (assuming the font has one). So, do you go ahead and split at the previous or next opportunity, leaving the "tt" ligature, or do you back out the ligature and allow a split between the t's (and no ligature)? Or something else? Maybe you were suggesting marking ligature replacements (by HS), feeding the un-ligatured word to Knuth-Liang (if splitting is possibly needed), and if it wants to pick the hyphenation point between the t's, back out the ligature? The glue (blank) lengths may have to be tweaked a bit to get full justification, but it should be minor. In languages where you have to add or change letters at the split, it might be a problem.

Did you find a bug where glyph boxes aren't being returned by the font routines? Does this happen only on certain fonts, and does it happen with all glyphs or just some? I'll have to go back and look, but doesn't HS return each glyph's extents? Does that work any better, or am I thinking of the x-y placements?

I'm not sure what you're getting at with your second post. I'll have to think about it. Can you elaborate a bit on what you're trying to do?

sciurius commented 2 years ago

odd word-splitting rules about adding or changing letters at the split

I think you mean cases where daeresis are used where a word is postfixed by en to make it plural. E.g. stoel + enstoelen. If the word ends with e a daeresis is added: ree + enreeën. Hyphenation goes the opposite direction: reeënree + en. In TeX: ree\discretionary{-}{en}{ën}. A traditional hyphenation dilemma is valkuil, which can mean valk-uil (Ninox ios, a bird) or val-kuil (trapping pit). You need to understand the context to hyphenate.

So, do you go ahead and split at the previous or next opportunity, leaving the "tt" ligature, or do you back out the ligature and allow a split between the t's (and no ligature)

Assume a hypothetical word 'naem' that will be typeset as 'næm'. I use it since the distinction between ae and æ is visible. First you ask for the width of the word 'naem' and you will get the width of 'næm'. If it fits, continue. If it does not fit, hyphenate into na - em.

I'm still struggling with the glyph boxes but I now seem to get the correct results with both HS and the CIDfont/TrueType stuff from PDF::API2/Builder. I'll keep you posted.

PhilterPaper commented 2 years ago

The word-splitting rule I was thinking about was something like "Drucker" being split "Druck-ker" or "Druk-ker" or something like that. I just vaguely remember something of that from an article I read long ago. You would know much better. It might even be obsolete orthographic rules.

English can have context-sensitive word-splitting, such as "record" splitting as "re-cord" (creating a permanent copy of something) or "rec-ord" (that resulting permanent copy). You need a deep understanding of the text to know which is being used. I think Knuth-Liang (Text::Hyphen and TeX::Hyphen) may sidestep the issue by forbidding splitting of that word. It also forbids hyphenation if the front fragment's pronunciation would be unknown until the rest of the word is seen (e.g., present or project).

Your "naem" example sounds correct. If it still needed hyphenation after ligature replacement, the ligature would have to be backed out to use the normal word-splitting. Now, if there were multiple ligatures in a word, that could get complicated. Maybe you could back out all ligatures, split the word as necessary, and reinsert all ligatures except the one (possibly) split?

If you have any intent for introducing hyphenation into Text::Layout, perhaps this part of the conversation should be moved into its own topic? It's straying pretty far from glyph extents. Keep in mind that for use with PDF::Builder, I think it would only be using the HTML/Pango markup to change font characteristics, and not to put ink down at this point (unless you also plan to do proper paragraph shaping with arbitrary column shapes).

sciurius commented 2 years ago

A typical case where the ink bbox is relevant is when you want to add backgrounds/borders.

scrot20211222080745

For the time being I'm not planning to add hyphenation, for many reasons. On the long(er) term I still want to try to integrate Pango. It should be possible to create a pseudo-surface that just collects the Pango instructions to put the glyphs, resulting in an array of glyph id + coordinates for the complete text of a (multi-line, filled, adjusted, hyphenated) Pango layout.

Raku is already doing a lot of work in this direction.

PhilterPaper commented 2 years ago

If Pango can do the stuff that Knuth-Plass can do, I may not bother with fixing and upgrading Text::KnuthPlass. Do you anticipate

  1. high quality line-splitting (hopefully using KP paragraph shaping)
  2. irregular columns (not just rectangular columns, but with cutouts, inserts, and floats)
  3. use of HarfBuzz::Shaper to support ligatures and complex scripts
  4. some means to balance columns
  5. eventual support of dropped capitals, word/line Small Caps, etc. (fairly trivial: DC means a variable length indent over several lines, SC should be just a font change)

For hyphenation (word-splitting), I anticipate having to roll my own Knuth-Liang module that improves upon Text::Hyphen and TeX::Hyphen by allowing switching among multiple dictionaries on the fly, and easy updating of the hyphenation pattern and exception tables from central repositories. If anyone is interested in taking this on as a project, let me know.

If you've never seen it, take a look at the product ($$) Prince for ideas and inspiration. They may be well ahead of us, but it's not free.

sciurius commented 2 years ago

I don't know if Gnome/Pango has interest in adopting KP paragraph shaping. KP is mostly interesting/relevant/needed when you can hyphenate and/or have floats. As far as I know the goal of Pango is to fill rectangular areas with text content that may consist of several different styles, colours, languages and so on. Harfbuzz adds ligatures, language scripts and writing directions.

While experimenting with Text::Layout in preparation of multi-line paragraph support I feel I'm re-inventing code that has already been fleshed out and in operation for a while. Re-implementing a small subset (as Text::Layout currently provides) is nice for certain tasks but when it comes to more complex situations I'd rather like to use real Pango instead. That was one of the reasons I adopted the PangoLayout API as closely as possible.

To get the best of both worlds, we would need a PDF::API2/Builder surface for Pango/Cairo. But developing such a beast is probably quite hard. I have no idea, but it seems that noone is developing surfaces at all.

On the other hand, PDF::API2/Builder are lagging behind. No support for complext CID fonts, WOFF fonts, writing directions, attachments, forms, fontconfig, ... So you may ask if that is the way to go.

I estimate that I can migrate the vast majority of my PDF producing tools (or PostScript — I still have a lot of them too) to Cairo/Pango pdfSurface with limited efforts. For a few tools I require features that pdfSurface does not (yet?) provide. For these, the current Text::Layout is sufficient.

PhilterPaper commented 2 years ago

On the other hand, PDF::API2/Builder are lagging behind. No support for complext CID fonts, WOFF fonts, writing directions, attachments, forms, fontconfig, ...

Still true for PDF::API2, but PDF::Builder does support HarfBuzz::Shaper for complex fonts and non-LTR writing directions. No idea whether WOFF is supported (would that need a new WOFFFont system like ttfont, psfont, corefont, etc.?). PDF::Builder handles attachments. Forms are still pretty weak. fontconfig... I'm hoping that Text::Layout can help with this.

PhilterPaper commented 2 years ago

As the original extents problem was fixed, and this discussion wandered off into hyphenation, I'll go ahead and close it. Incidentally, an improved version of Text::Hyphen or TeX::Hyphen would be quite desirable:

  1. easy update of the rules and exceptions files from a central source (CTAN)
  2. ability to switch on-the-fly among languages
  3. extend standard rules and exceptions with your own entries

If someone is looking for a project, this might be a good one. I'll get to it some day, if no one else does.