lvgl / lv_font_conv

Converts TTF/WOFF fonts to compact bitmap format
https://lvgl.io/tools/fontconverter
MIT License
175 stars 77 forks source link

Converter misses opportunity to detect identical glyphs, stores them as separate images #120

Open pavmick opened 3 weeks ago

pavmick commented 3 weeks ago

As the title says. I am converting ASCII and Cyrillic ranges. The letter A, for example, is present in both ranges and it is being stored twice. Interestingly, the stored images are slightly different. Same for other identical glyphs. It should not be too troublesome to detect duplicate glyphs and store one copy only.

kisvegabor commented 2 weeks ago

How can we know if the ASCII A is the same as Cyrillc A? Check it on the rasterized image?

pavmick commented 2 weeks ago

I believe font files have facilities that allow different Unicode code points to reference the same glyph. For example, you can go to https://fontdrop.info/ , load arial.ttf, scroll down to unicode 0410 (Cyrillic letter A) click on it and observe "This composite glyph is a combination of: glyph 36". If you click on the letter A from ASCII range (close to top of table), you'll see same index 36.

kisvegabor commented 2 weeks ago

How many glyphs can be affected by that? I estimate it to max. 1% (but probably closer to 0.1%). What do you think?

pavmick commented 2 weeks ago

Let's see. For the Russian alphabet, I would say 11 uppercase and 8 lowercase letters share glyphs with ASCII. That would be 15% of ASCII range.

kisvegabor commented 1 week ago

Okay, it's really significant is this case.

So the task is to make the duplicated glyphs point to the same bitmap, right? If so, I'm okay with this feature. However I'm not a JS developer and can't work on the implementation.

Do you have time and interest to implement it?

cc @puzrin

puzrin commented 1 week ago

Guys, before discussing any changes, it's worth providing proof that the source font has multiple character codes mapped to the same image. If source images are different, that's the intent of the font authors, not a converter issue.

The TTF format has different tables for "images" and "char codes." AFAIK if an image has multiple references from char codes, the convertor should preserve them (but I'm not sure and don't remember details).

puzrin commented 1 week ago

Also worth refer binary format as base. The "lvgl" one is less optimal, focused on text representation of the source. Binary is a close subset of TTF, with minor local changes about raster/compression instead of vectors.

pavmick commented 1 week ago

So I looked closer at arial.ttf using fontdrop.info online tool. I can confirm that Russian letters АВЕМНОРТХаенорсух share glyphs with regular ASCII letters. That's 17 glyphs. This set could vary slightly from font to font, but I don't expect major variations. I am mostly an embedded C developer with some knowledge of JS. But I'll see if I can dive into the code and suggest patches.

puzrin commented 1 week ago

So I looked closer at arial.ttf using fontdrop.info online tool. I can confirm that Russian letters АВЕМНОРТХаенорсух share glyphs with regular ASCII letters. That's 17 glyphs.

And you used the same font in convertor, when found duplicated images? And the same problem in binary format?

pavmick commented 1 week ago

And you used the same font in convertor, when found duplicated images? And the same problem in binary format?

Just ran the converter on arial.ttf. Yes, the glyphs in question are duplicated. This time exact copies, to the last bit. I am not using the binary font format in my applications, so I can't confirm this behavior with it.

puzrin commented 1 week ago

There is a chance we ignored deduplication to save time. But that's 100% not internal [binary] format restriction (don't remember about lvgl).

kisvegabor commented 1 week ago

In LVGL we can also reference any bitmap_index for a glyph. See

 {.bitmap_index = 1307, .adv_w = 128, .box_w = 8, .box_h = 8, .ofs_x = 0, .ofs_y = -1},