Base font glyph + uncommon glyph system

yhahn commented 10 years ago

Eventually move to a system where some common glyph images are downloaded as a "base font glyph file", and only include uncommon glyph images in the vector tiles.

@kkaefer curious what you have in mind here. I at one point envisioned a "glyph tile" system where you'd download/cache whole chunks of unicode ranges as you ran into them

http://jrgraphix.net/research/unicode_blocks.php

This would separate the tiles fully from font glyphs, but not sure what other considerations should be involved here.

kkaefer commented 10 years ago

That's what I thought too initially. Upon further examination, this turns out to not be such a good idea, in particular for asian languages: Mandarin has around 20k glyphs, and a typical tile uses ~50-200 of them. They seem to be mostly random; there is no concentration on a few unicode ranges. So this means that we'd have to download most of the font anyway (I measured ~5 MB to download all required font ranges for Mandarin)

yhahn commented 10 years ago

(╯°□°）╯︵ ┻━┻

yhahn commented 10 years ago

Next actions for me

I'd like to get a better feel for the requirements here. Next actions for me are to set up a demo that "simulates" download requirements by inspecting the loaded vector tiles as you pan around a map. Idea is to make glyph block sharding variable and see if there is a sweet spot in terms of shard size that optimizes no. of requests + amount of data to download.

yhahn commented 10 years ago

Rough cost of each glyph

{ size: 724727, count: 1383, avgsize: 524.0253073029646 }

Basic idea is to put a rough heuristic cost on each glyph on the deflated PBF.

Compares final deflate size of original VT (no augment) vs VT with glyphs (augmented),
Counts number of glyphs in augmented VT,
Produces an "average" cost per glyph.

This includes not just the texture cost but the overhead of metadata in the PBF for glyph position, etc. Samples broader range of glyphs -- tiles are from dense areas of SF, tokyo, shanghai, tel aviv and berlin. The avgsize here is probably on the large side for calculating glyph-only tile size because fontserver is also providing info for each feature string in the tile.

Using 524 as avg bytesize per glyph, ballpark sizes for diff glyph tile shard sizes:

no of glyphs	kb
128	65.5k
256	131.0k
512	262.0k

Next up

I'll be using these numbers to play around with simulated glyphtile DL scenarios while panning around diff parts of the world.

cc @kkaefer let me know if I'm way off here.

yhahn commented 10 years ago

@kkaefer based on a totally rough count (20x20 glyphs, say, === 215 bytes per glyph) from https://f.cloud.github.com/assets/52399/1181283/e12a5092-2205-11e3-9ddf-e4fc20b22b39.png, it seems like the cost above is pretty large. This would mean to me that either the png compression is beating deflate a lot, the overhead of the glyph position encoding is more than I would guess, or both.

What is the reason this information is encoded witih each feature string btw? It seems like it should be possible to look this up on the fly (but maybe I don't understand all the mechanics here):

https://github.com/mapbox/fontserver/blob/master/test/expected/shape.json#L703-L708

kkaefer commented 10 years ago

I think you're in a good spot with estimates. In my experience, tiles were about 30-40% larger with glyph images than without. PNG compression is zlib, but it employs additional tricks like filtering to get better run lengths and repetition.

The reason the glyph positions are included for every string rather than using the glyph advance is that for complex scripts that perform glyph substitutions, there is no 1:1 mapping from a character in the unicode string to a glyph.

While for latin, cyrillic and most east asian scripts, there usually /is/ a 1:1 mapping, scripts like arabic don't: ا and ل in a unicode sequence produce ﻻ when shaped, which is just one glyph (cf. http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=CmplxRndExamples for more background).

After we get the shaping results, we could look and employ the automatic glyph-advance-based shaper to see if this produces the same result. If it does, we could drop the shaping information and use the same algorithm on the client.

kkaefer commented 10 years ago

Oh, another issue that shaping is handling is bidirectional text: In RTL languages that use arabic numbers (like hebrew, arabic) you have text runs that go from right-to-left, then the numbers are written left-to-right, then the text continues right-to-left. Mapnik has code that handles this, Pango handles this in the shaping as well.

yhahn commented 10 years ago

Request overhead

Initial impressions/numbers from playing around with a hacked llmr to estimate glyph tile req count + dl size is not good. More than anything else, viewing anywhere outside the US leads very rapidly to a huge request count overhead in addition to the tiles being downloaded (e.g. think 20+ additional GETs for an initial map load.) Per @kkaefer's original notes, the way glyphs are used just doesn't lend itself well to good sharding/clustering.

Next actions

Look at just straight up reducing size next,
Lower priority on a base font glyph resource approach which is likely to introduce more complexity and have weirdly hard-to-predict effectiveness on a global map

yhahn commented 10 years ago

More notes,

Look at just straight up reducing size next,

bitmaps are the majority of our cost at ~80-90% of the overhead. We can leave messing with tighter metadata/encoding to later/never.

the distance fields can be compressed better (think up to 20-30%) without loss of visual quality by reducing the granularity of the 0-255 value (e.g. steps of 2, 4, 8). At around 16 distinct values you start to notice artifacts.

No huge savings to be had here.

Lower priority on a base font glyph resource approach

Reversing this thought. Assuming we provide base glyphs for 0020-00FF, the number of distinct glyphs per tile:

tile	distinct glyphs	distinct glyphs excluding 0-255
berlin.vector.pbf	108	4
sfo.vector.pbf	86	3
shanghai.vector.pbf	389	316
telaviv.vector.pbf	124	52
tokyo.vector.pbf	722	647

The character sets that will benefit from this most are 0020-04FF I think (up to Cyrillic). The glyph coverage problem is just the nature of the beast, esp in the CJK languages, and delivering glyphs on a one-off basis with tiles seems most efficient there.

Conclusion: basically I did a round trip to what @kkaefer already recommended : )

Next actions:

Start with just 0-ff as our base glyph. We can come up with additional base glyphs/extensions through 0-4ff if this is successful. It looks like @kkaefer has some commented out code in fontserver for this where I can get started.

mikemorris commented 10 years ago

After spending a while getting myself oriented to the codebase and Node C++ addons in general, I managed to get an initial sketch of a base glyph pattern hacked into place. This is nowhere near being clean or good code, just a place to start from. What should next steps be in terms of testing versus optimization?

Tracking in:

/cc @yhahn @kkaefer

mikemorris commented 10 years ago

Initial observations:

Base glyph with 0-ff shaves off about 20k per tile, replaced with an initial 60k request.
To ensure reliable rendering, no tiles can be rendered until after the base glyph has been loaded.
Currently adds overhead of copying base glyphs/rects into each web worker, don't think transferrable objects would be of much help here seeing as how we need to retain the glyph atlas in the main thread.

mikemorris commented 10 years ago

Copying even a small 0-ff glyph library into each web worker looks to be adding significant overhead, increasing the main thread to worker callback delay from single digit milliseconds to often over 100ms and in some cases over 300ms.

~~False alarm, I broke something else implementing baseglyph branch apparently, tested a more stripped down version and was back to single digit millisecond times.~~

The stripped down version wasn't actually writing glyphs to the copied object. Doh.

Next steps:

[x] Attempt to restructure to avoid copying glyph lib into worker.

kkaefer commented 10 years ago

@mikemorris We don't need to copy the glyph images to the workers, just the position information. That should be pretty fast.

mikemorris commented 10 years ago

@kkaefer Alright, main hangup was that glyphs are expected to be on the faces object at https://github.com/mapbox/llmr/blob/master/js/text/placement.js#L228, ~~gonna try attaching them after the worker returns.~~ and the stripping at https://github.com/mapbox/fontserver/blob/baseglyph/src/tile.cpp#L681-684 completely removes them from the protobuf.

mikemorris commented 10 years ago

Tested packing rects into typed arrays and sending to the worker as transferrable objects. This ended up being slower because of the increased overhead of unpacking the typed arrays into a usable structure.

mikemorris commented 10 years ago

Okay, so I was using an ASCII hack to build the base 0-FF base glyph tile, now I'm running into all sorts of confusion trying to expand the base glyph to a larger Unicode range - any suggestions?

/cc @springmeyer @artemp @kkaefer

kkaefer commented 10 years ago

https://github.com/mapbox/fontserver/blob/baseglyph/src/tile.cpp#L729-L732 looks very fishy to me. What is it supposed to do?

mikemorris commented 10 years ago

@kkaefer It's a super hacky way of iterating over a range of characters, that really only works for 0-128 ASCII. What I'm trying to do is iterate over a Unicode character range (say 0000-06FF for Latin, Cyrillic and Arabic) to build a base glyph set for each font in the stack. Haven't quite figured out how to get a reference to the actual PangoFont object, as it looks like the regular tiles are building the font list dynamically.

mikemorris commented 10 years ago

Not quite sure what's causing this issue, but it looks like FreeType is returning an error when attempting to load many glyphs in the Greek+ Unicode ranges in Open Sans. The puzzling part is that the glyph_index has been validated by g_unichar_validate and Open Sans is being picked by pango_fontset_get_font as the font that contains the best glyph (https://github.com/mapbox/fontserver/blob/8db40c458341b785b0001a66d86b632802025413/src/tile.cpp#L736).

Unicode Range	Size	Glyphs
0000-03A9	225KB	Max range without FreeType error 6 `Invalid_Glyph_Index`

FreeType Error Codes: http://www.freetype.org/freetype1/docs/api/freetype1.txt

mikemorris commented 10 years ago

Invalid Glyph Index was a result of passing char_code where glyph_index was needed.

Unicode Range	Size	Glyphs
0000-04FF	301KB	Latin, Greek, Cyrillic
0000-06FF	376KB	Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic

mikemorris commented 10 years ago

Why are all these extra fonts includes in PangoFontset?

Open Sans 24
Arial Unicode MS 24
Arial Unicode MS Bold 24
DINPro 24
GE Inspira 24
GE Inspira Small Caps 24
Helvetica LT Std 24
PT Sans Caption 24
Proxima Nova 24
League Gothic 24
TeX Gyre Heros 24
Source Sans Pro 24
Pompiere  24
Avenir 24
Freehand521 BT 24
Frutiger LT 45 Light, 24
Komika Parch 24
Visitor TT2 BRK 24
Avenir 24
Ubuntu Mono 24
Ubuntu Condensed, 24
Ubuntu 24
Crimson 24
Chennai Medium 24
Merriweather Light 24
Crimson Semi-Bold 24
Crimson Bold 24
Crimson Italic 24

mikemorris commented 10 years ago

I think the eventual goal should be to move fontserver to HarfBuzz/FreeType, same stack as Mapnik and Firefox (Chromium/Blink is using HarfBuzz directly too but still uses fontconfig as well). Pango simply isn't designed to offer enough control. All of these projects use FriBidi for bidirectional text.

Short term goal is to fix base glyph loading in llmr to ensure the base glyph set is loaded into the glyph atlas before tiles are rendered before removing Pango though.

springmeyer commented 10 years ago

To be clear: The browsers may use Fribidi but Mapnik does not: we moved from fribidi to icu for bidi in 2008: http://mapnik.org/news/2008/02/20/mapnik_unicode/

mikemorris commented 10 years ago

fribidi was choking in multi-threaded rendering

Ah, because of the above you retained ICU bidi even after switching to HarfBuzz @springmeyer?

springmeyer commented 10 years ago

We first replaced fribidi with ICU. Because ICU only supports shaping for arabic, we then added harfbuzz for better shaping. For now we've kept ICU for a variety of things: notably for text itemization which uses the ICU bidi algo: https://github.com/mapnik/mapnik/blob/master/include/mapnik/text/itemizer.hpp

mikemorris commented 10 years ago

Welp, moved the Font class using Pango to Pango_Font and got a FT_Font class cobbled together (pulling in useful-looking pieces from Mapnik), only to discover that this class was being used in exactly one spot in src/shaping.cpp

It looks like src/tile.cpp is using PangoFont directly instead of the wrapper class, does it make sense to switch this to the new FT_Font class @kkaefer?

Work is in the harfbuzz branch.

mikemorris commented 10 years ago

As a followup, under what circumstances is that use case of Font actually triggered @kkaefer? I added some logging and didn't notice that path ever running, even when panning around areas with shaped text.

kkaefer commented 10 years ago

@mikemorris There used to be a separate interface for just creating fonts and inspecting these fonts (like enumerating all glyphs in that font and getting their metrics) which I used for debugging purposes. We don't need that interface anymore.

mikemorris commented 10 years ago

Thanks @kkaefer. After all this wrangling I at least have a pretty solid understanding of FreeType now to tackle the rest.

mikemorris commented 10 years ago

Glyph ranges implemented, next steps for base + uncommon in https://github.com/mapbox/fontserver/issues/36

mapbox / node-fontnik