simoncozens / atokern

Neural network based font kerning
85 stars 7 forks source link

Spacing & kerning at once. #3

Open skosch opened 6 years ago

skosch commented 6 years ago

Hi Simon,

Amazing project – I tried something like this many years ago but the tooling around neural nets was so poor that I didn't get anywhere. In my opinion, you're going about this exactly the right way, so I'm really excited I found this. I want to bring up something that was already mentioned in the Typedrawers thread: the idea of doing letter-fitting in one go, i.e. spacing and kerning at the same time.

You can use a linear programming model to turn the resulting glyph distance pairs into a set of sidebearings plus a kerning table that's minimized by the sum of absolute or squared kerns (i.e. what I did in skosch/fittingroom). You also get around ethical complaints, because your training data is now based on total glyph distances, not raw sidebearing and kerning values, which amounts to learning from printed matter instead of pirated font file data.

If you're worried (as you mentioned in your TD post) about unusual pairs like /t/Q messing up your training data ... maybe you can go through a few big corpora and identify pair frequencies and then drop uncommon pairs from the training data (or put them into low-learning-rate batches), as those are the ones designers would probably ignore.

I'd love to learn what your plans are for this project. Looking forward to your comments.

simoncozens commented 6 years ago

Hi Sebastian! Thanks for the feedback!

I think spacing and kerning together is worth looking at, but it's not going to be my concern; someone else can do it. The problem is that in our current model of OpenType fonts, we need to write out the spacing and kerning tables separately anyway, and given that we have autospacers already I want to work on the piece of the problem that isn't yet solved. I did write some code a year or so back which uses the linear programming approach to "compress" kern tables by adjusting the spacing to maximize the number of zero pairs, but I think that's a slightly different problem. Anyway, I need to look t fittingroom again.

I saw Fitting Room a while back and it looks really interesting, although I hadn't realised quite how interesting. Using feature extraction to create kerning groups is something that would both simplify what I'm doing and make sense of the output - at the moment the output is a kern value for each glyph pair, and that's clearly going to blow your kern table limits. I don't currently have a plan for how to turn that into kern groups. I need to look into what you're doing a lot more.

There is some code in atokern for adjusting the sample weights by bigram frequency. I've just started using it again.

Recently I've started to get better and better results from atokern, for a variety of reasons. I first realised the data needs to be shuffled before training, or otherwise the system learns all about one font, then fails utterly when it gets handed the next font. Second, I realised you have to put a cap on the number of kern pairs per font or otherwise things similarly get unbalanced - one font with 200,000 pairs will have more influence than one with 4,000. I'm also currently trying out (but haven't committed to git yet) a new network architecture which takes into account the fact that the categories are ordinal. I've been training for a day and just hit 93% validation accuracy, which is really nice. And it's still going.

Of course, 93% accuracy means that if you kern A-Z against a-z you still get 47 pairs wrong. And I can give no guarantees about which ones they are. So you still need to verify all your pairs anyway.

Life is hard.

skosch commented 6 years ago

given that we have autospacers already

I mean, the output of your net already includes whatever sidebearings you started out with, so in a way you're very much already doing spacing + kerning together. Shifting your perspective (away from "learn from kern pairs to generate kern pairs" to "learn from total distances to generate total distances") is still worthwhile though because it naturally keeps you from running into problems like ...

one font with 200,000 pairs will have more influence than one with 4,000.

If I understand correctly, you're effectively training your net on all the "weird" pairs right now (the ones with kern pairs) but missing out on all the great training data that is non-kerned pairs – both of which are equally laden with information about what humans find visually pleasing. I bet that training your net on the 10000 most common bigrams in every font, kerned or not, will get you to better accuracy.

I mean, as you've already found out,

at the moment the output is a kern value for each glyph pair, and that's clearly going to blow your kern table limits

you will have to resort to some sort of mathematical optimization anyway – so I don't see the additional downside. Because any net will optimize for glyph distances anyway and not for effective OT tables.

Anyway, I need to look t fittingroom again.

I don't think you need to look too closely at that project – it was just an experiment, really. However, it may be valuable to think about

1) moving away from arrays of horizontal-side-to-outline distances, and towards some sort of holistic, wavelet-encoded representation of the overall shape of the glyph pair, in hopes of approximating more closely the response to text "colour" or "rhythm" in our visual cortex; 2) designing an optimization model that doesn't just maximize zero pairs, but maximizes (within some tolerance) classification into OT kern classes. That's a much, much harder combinatorial problem unless you take some shortcuts. If you're interested I'm happy to discuss this further.

Ultimately it'd be extremely cool if we gained enough insight into the net's behaviour to build an evidence-based theory of letterfitting that fits our understanding of human visual processing. For instance, think about how optical sizing relates to peak spatial frequency responses: in the limit, large-size text is best fitted by equalizing the Euclidean distance between outlines, while very small text is best fitted by equalizing stem distances – it may not be too far-fetched to hope for a unified theory.

skosch commented 6 years ago

(P.S. I hope I will find some time soon to play with your code myself. Talk is cheap, I know!)

simoncozens commented 6 years ago

If I understand correctly, you're effectively training your net on all the "weird" pairs right now (the ones with kern pairs) but missing out on all the great training data that is non-kerned pairs – both of which are equally laden with information about what humans find visually pleasing.

Yes and no. I'm currently learning from all the kern values of [A-Za-z0-9,.]/[A-Za-z0-9,.] whether they are defined or not, and then a few thousand "extra" defined pairs found by going through the kerndump list.

The problem with looking at non-kerned pairs is that you never know if the designer deliberately left גT unkerned because it already fits just fine, or if there's no kern value there because the designer never thought to kern that particular pair. That's why I only trust the zero values of "safe" pairs like [A-Za-z0-9,.]/[A-Za-z0-9,.]

But then, the number of unkerned pairs in the safe glyphs is normally already pretty high (typically around 80%) so you need to fill up with a few weird kerns or else the network just learns that a kern value of zero is pretty good for most circumstances.

you will have to resort to some sort of mathematical optimization anyway – so I don't see the additional downside. Because any net will optimize for glyph distances anyway and not for effective OT tables.

What would be an interesting (and possibly necessary) thing to do is get a neural network to first organise your glyph list into kern groups for you. I just don't know enough about clustering to do that. :(

skosch commented 6 years ago

The problem with looking at non-kerned pairs is that you never know if the designer deliberately left גT unkerned because it already fits just fine, or if there's no kern value there because the designer never thought to kern that particular pair. That's why I only trust the zero values of "safe" pairs like [A-Za-z0-9,.]/[A-Za-z0-9,.]

What kind of barbarians don't kern גT? I use that all the time :smile: I get your point though, I wasn't thinking of those extra-weird ones. Your approach seems sensible.

or else the network just learns that a kern value of zero is pretty good for most circumstances.

That's unfortunate. I suppose this also means that if you did turn it into a total-distance regression problem, the network would learn that roughly-average distances are pretty good most of the time?

Perhaps the input data is really just too far removed from what's salient to a designer's eye. I'm not entirely surprised, actually: I have spent an entirely unreasonable amount of time, during my late teens and early twenties, analyzing glyph distance data trying to crack this puzzle. Getting to ~90% accuracy was always easy, but passing the visual Turing test was always impossible. I hope I'm wrong, but it really may be necessary to look at this problem in frequency space, and to look at n-grams of 3 or more.

Life really is hard. That being said, all of that will be so much easier to experiment with now that we have atokern as a starting point.

What would be an interesting (and possibly necessary) thing to do is get a neural network to first organise your glyph list into kern groups for you. I just don't know enough about clustering to do that. :(

Well, scikit-learn does it for you for free if you're willing to rasterize the outlines :)

simoncozens commented 6 years ago

I hope I'm wrong, but it really may be necessary to look at this problem in frequency space, and to look at n-grams of 3 or more.

Well, in a sense I do look at n-grams, in that I always feed in the outlines of "Ho", so that the system has a bit of context to play with. This also encodes default spacing for straights and rounds, so it should work to pull spacing data out of this thing as well.

Neural networks often have a hard time with regression. I'm dealing with it by binning the kern pairs into categories of relative distance: "this pair is kerned by somewhere between 3%-5% of an em width", etc.

Well, scikit-learn does it for you for free if you're willing to rasterize the outlines :)

Hmm, worth a look, definitely.

I don't have a GPU or a particularly powerful machine, so my experiments tend to take weeks; that's my major source of frustration with all this.

skosch commented 6 years ago

it should work to pull spacing data out of this thing as well.

True, for sure! What I mean though is that the horizontally-scanned contour samples fail to capture a good chunk of the shape "feel" of the glyphs: counters, stem contrast differences, curves, serifs – all of which factor into the overall too-tight/too-lose decision, but which the network literally can't see.

Also, when you feed in the outlines of "Ho", does that mean that you feed them in alongside every kern pair, to make up for the overall tight-ness/loose-ness of that font?

Neural networks often have a hard time with regression. I'm dealing with it by binning the kern pairs into categories of relative distance: "this pair is kerned by somewhere between 3%-5% of an em width", etc.

I mean ... given such small bins, 93% is actually pretty impressive. You know what, it would be super interesting to visualize the errors though. Can you make a scatterplot of bin as decided by the designer vs bin as decided by atokern, with one point for every pair? That would tell us more about what the 93% accuracy really mean. Do the errors cluster around zero or are they totally random?

I don't have a GPU or a particularly powerful machine, so my experiments tend to take weeks; that's my major source of frustration with all this.

Me neither :( most clustering methods are really fast though, compared to training deep nets.

simoncozens commented 6 years ago

Also, when you feed in the outlines of "Ho", does that mean that you feed them in alongside every kern pair, to make up for the overall tight-ness/loose-ness of that font?

Yep, the input to the network is "rightofl", "leftofr", "rightofo", "leftofH". (where "l" and "r" here mean the actual kern pair, not the letters l and r!) Each of these inputs is chewed on separately by a convolutional layer, and then the results are concatenated together.

I mean ... given such small bins, 93% is actually pretty impressive.

I don't know if it's literally 3-5%. I'm using 25 bins. They're not of equal size; the bins are smaller more around the origin, e.g. one bin is -0 <-> -5 units, one is -30 <-> -35, one is -100 <-> -150

Can you make a scatterplot of bin as decided by the designer vs bin as decided by atokern, with one point for every pair?

Good idea! atokern-acc

Looks like I'm still missing a lot of pairs that should be kerned but aren't.

simoncozens commented 6 years ago

That was for one font in the validation set. Here's for the whole set (n=29575): atokern-acc

simoncozens commented 6 years ago

(In case you're wondering, the outlier was the pair "Q," in Optima Nova Bold Italic. Akira was right and atokern was wrong.)

skosch commented 6 years ago

Really interesting plots, thanks for sharing!

Each of these inputs is chewed on separately by a convolutional layer, and then the results are concatenated together.

Makes sense. However, you run the risk of the Ho layer learning to latch onto random irrelevant features. Perhaps it's safer to just feed in a scalar measure of font tightness directly, say, scaled no and oo stem-center distances which you measure by hand ahead of time. That would also save you a layer (and therefore time), no?

simoncozens commented 6 years ago

What would be an interesting (and possibly necessary) thing to do is get a neural network to first organise your glyph list into kern groups for you. I just don't know enough about clustering to do that. :(

Well, scikit-learn does it for you for free if you're willing to rasterize the outlines :)

Looks like I don't have to rasterize the outlines:

from sklearn.cluster import DBSCAN
loutlines, routlines, kernpairs, mwidth = loadfont(path, None)
glyphnames = []
outlines = []
for g in safe_glyphs:
  glyphnames.append(g)
  outlines.append(loutlines[g])

outlines = np.array(outlines)
db = DBSCAN(eps=200,min_samples=1).fit(outlines)

gives me:

Estimated number of clusters: 50
['C', 'G', 'O']
['F', 'D', 'N', 'K', 'R', 'B', 'P', 'E', 'L', 'H']
['t']
['e', 'c']
['p', 'n']
['q']
['u']
['l']
['Q']
['two']
['M']
['four']
['i']
['colon']
['y', 'v']
...

which seems like a pretty good start.

skosch commented 6 years ago

Nice!

By the way I just came across an a relevant article today – they used restricted Boltzmann machine networks pre-trained on natural images, which is a pretty cool idea. The results include a dendrogram of letter similarity: https://www.nature.com/articles/s41562-017-0186-2#Sec9 (paywall, but there's SciHub) ... it doesn't directly have practical application, but I think it's another piece of evidence that your approach may have merit (beyond the look-ma-no-hands kerning).