satbyy / go-noto-universal

Noto fonts go universal! Download pan-Unicode, merged Noto fonts according to time of usage (current, ancient) or geographical region (South Asia, SE Asia, Africa-MiddleEast, Europe-Americas).
Other
151 stars 21 forks source link

Why does GoNotoCurrent not render Korean glyphs whereas GoNotoCJKCore does? #39

Open xplip opened 2 years ago

xplip commented 2 years ago

Thank you for providing this great library! I am currently trying to render text in various languages with the pygame library and it seems that when I am using GoNotoCurrent, I can render Japanese and Chinese glyphs just fine, but Korean glyphs are only rendered as empty boxes. When I am using GoNotoCJKCore, Korean is rendered properly as well, so I am wondering what the main difference between the two is. I can get around the issue by rendering my texts with the Pillow library and a libraqm layout engine which builds on harfbuzz, but this is horribly slow, so I'd prefer to keep using pygame and get it to work with GoNotoCurrent. Do you have an idea why rendering Korean might not work in my setup?

satbyy commented 2 years ago

Hi Phillip, thanks for the bug report.

The reason is that GoNotoCurrent does not include "Hangul Syllables" Unicode block (U+AC00 to U+D7AF) whereas GoNotoCJKCore does. This block contains about 11000+ codepoints and at least as many glyphs. However, GoNotoCurrent is currently at ~61000 glyphs in the font file, the maximum limit being 64K (this limit is imposed by spec). Hence there is not really enough "glyph space" for including all of Hangul syllables. So, there is really not much that can be done.

One option is to find a smaller subset (say ~2500 glyphs) of the 11K codepoints and include them in GoNotoCurrent, still honouring the 64K limit. Obviously this leaves out a large chunk of the Korean repertoire, so it is of little practical utility.

dscorbett commented 2 years ago

Many precomposed syllables are not actually used in Korean. You could use KS X 1001’s list of 2,350 common Hangul syllables.

satbyy commented 2 years ago

@dscorbett I gave it a try on my local machine (KSX1001 subset), but now we hit the cmap format 4 table limit of 65535. Such subsetting causes fragmentation of "Hangul Syllables" block (U+AC00 to U+D7AF) -- the subset ttf's cmap 4 table is about 13000 length whereas GoNotoCurrent is already at 64706, so the total 77666 > 65535

satbyy commented 2 years ago

Or maybe I'm looking at this the wrong way. Attached below is the rendering of Korean wikipedia homepage, using GoNotoCurrent.ttf. It seems that the initial + final components are not combined/stacked correctly. Am I dropping some tables unknowingly?

korean-wiki

dscorbett commented 2 years ago

I forgot about 'cmap' fragmentation. I guess that idea won’t work.

The syllables are exploded because the lookups that join them together are not applied when the language system is Korean. I’m not sure why.

xplip commented 2 years ago

Thanks a lot for the explanations and taking a stab at it already! I wasn’t aware Korean relied so heavily on the precomposed syllables. If the glyph limit is reached then I suppose there is not so much that can be done.

I think for my personal use case, having Korean in the font is more important than the Math, Music, and Symbol Fonts, though. I quickly tried to rebuild the GoNotoCurrent font without those four (NotoSansSymbols-Regular.ttf, NotoSansSymbols2-Regular.ttf, NotoSansMath-Regular.ttf, NotoMusic-Regular.ttf) in the categories.sh and with this file https://raw.githubusercontent.com/sozysozbot/korean_hanja_sound/master/KSX1001.txt passed to pyftsubset via the --unicodes-file flag in create_korean_hangul_subset().

Out came a font file that seems to render my Korean sample texts fine. The command otfinfo -g GoNotoCurrent.ttf | wc -l returns 65251, so it looks like it didn't go over the glyph limit. I'm not really confident any of this was the correct approach, though, so I would appreciate it a lot if you could double-check this :)

satbyy commented 2 years ago

@xplip Yes, that is a good approach and that's all there is to it. Enjoy your new font!

rxsto commented 1 year ago

Hey there, sorry for bringing this topic up again, I originally thought I could just follow the steps proposed by @xplip and generate a GoNotoCurrent file with increased support for Korean Hangul syllables, but when trying to run the temporal_fonts.sh after both editing categories.sh (to remove the symbols, math and music fonts) and injecting the KSX1001.txt via the unicodes file flag in helper.sh at line 254, the process just randomly crashes.

The stacktrace is as follows:

Traceback (most recent call last):
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/bin/pyftmerge", line 8, in <module>
    sys.exit(main())
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 372, in wrapper
    return func(*args, **kwds)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/merge/__init__.py", line 201, in main
    font.save(outfile)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 185, in save
    writer_reordersTables = self._save(tmp)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 225, in _save
    self._writeTable(tag, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 654, in _writeTable
    self._writeTable(masterTable, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 654, in _writeTable
    self._writeTable(masterTable, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 658, in _writeTable
    tabledata = self.getTableData(tag)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 680, in getTableData
    return self.tables[tag].compile(self)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 132, in compile
    glyphData = glyph.compile(self, recalcBBoxes)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 673, in compile
    data = data + self.compileComponents(glyfTable)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 903, in compileComponents
    data = data + compo.compile(more, haveInstructions, glyfTable)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 1469, in compile
    return struct.pack(">HH", flags, glyphID) + data
struct.error: 'H' format requires 0 <= number <= 65535

From what I can tell this exception gets thrown whilst trying to merge the base font files into the big single font file.

Since I am unfortunately pretty new to this field I am quite clueless on what to do in order to fix this issue. The last logs before this exception happens are always different, so there's nothing that would help debugging it. The first issue I was thinking of was that maybe there might somehow be too many glyphs to fit into the font file. Confusingly enough this exception occurred even after removing more fonts from the categories.sh file.

I am running the temporal_fonts.sh file on WSL2 22.04, and I think it could potentially be related to that, since the crashes appear so inconsistently.

Any help or hint on how to get this working would be greatly appreciated! Thanks so much for the awesome work :)

rubiomiguel06 commented 1 year ago

I am facing the same issue as @rxsto . I am running the script on macOS Ventura, and after following the steps proposed by @xplip, I am getting the exact same stacktrace. Where you able to fix it, @rxsto ?

rubiomiguel06 commented 1 year ago

For the record:

I have managed to fix the issue I was facing. Basically, there were more glyphs than what the spec allows (64K). Thus, the error struct.error: 'H' format requires 0 <= number <= 65535.

@xplip explanation is good, but, to make it clearer and easier, I would change the following line:

... and with this file https://raw.githubusercontent.com/sozysozbot/korean_hanja_sound/master/KSX1001.txt passed to pyftsubset via the --unicodes-file flag in create_korean_hangul_subset().

for:

In helper.sh, inside the method create_korean_hangul_subset() add the following codepoints:

codepoints+="U+AC00-D7A3," # Hangul syllables

That way all the Hangul syllables are added to the korean subset font and the glyph count limit is respected.

I hope I am not skipping any important glyphs for Korean. All my tests were successful, so I don't think so.

stephen0z commented 1 year ago

AFAIK, usually open source fonts projects, especially large fonts with many glyphs, have their fonts made in 2 files.

Take "Hanazono fonts" as example: https://osdn.net/projects/hanazono-font/

They release their font Hanazono in 2 files: HanaMinA.ttf HanaMinB.ttf

HanaMinA.ttf are font containing CJK glyphs, which are more commonly used, and HanaMinB are font with less used glyphs.

Most systems nowadays - Windows, *nix, Android can be set to use them as a pair.

2 files each 65536 glyphs should be enough for daily uses.

satbyy commented 1 year ago

@stephen1864 Thanks, that is a good idea to create two "A" and "B" fonts, one with Korean glyphs and one without them. I could work on it in the coming days or weeks.

user6905 commented 1 year ago

I also trapped in the issue of the Korean symbols missing. But as I'm not much experienced with font creation (what must be in/ what not), I could not follow all the discussions here.

I think the workaround of xplip is the one I need (I can easly skip Math, Music, and Symbol Fonts, but I need Korean) , but currently I have no idea how to create the font correctly? May it be a idea to provide that font too (or provide that to me by some way)? This will help me very much.

On the other hand the separation to GoNotoCurrent A and B Font may help, if the A font is similar the GoNotoCurrent with Korean.

As I like to use the Font for embedding a PDF, I think to use it as a pair may not be a working idea. I need to use one TTF font.

stephen0z commented 1 year ago

@user6905 For embedding a PDF, it is best to put what is only needed, othewise the PDF will grow extremely large. If you don't need Math, Music, and Symbol Fonts but complete Korean, you may go directly to Noto Font, which is the source of this project, and choose one useful:

https://fonts.google.com/noto/fonts?noto.lang=ko_Kore&noto.continent=Asia&noto.script=Kore

user6905 commented 1 year ago

@Stephen: That does not help in my case. I must be as universal as possibel, because or international use. But ist limited to technical conversation. For this reason I can Math, Music, and Symbol Fonts. But for Korean, I now its used. So a single Noto font makes no sense. I search for a better replacement of the UniFont. So GoNotCurrent is perfect (much better than UniFont), if it supports Korean. Genrally in PDF ists not that bad, as I can subset the font and and because of some pictures, the font is not the only reason why the PDF gets a bit bigger. and anyway the font can be subsetted in the PDF. So I really need an universal Font like GoNotoCurrent with Korean symbols.

rubiomiguel06 commented 1 year ago

@user6905 here's the font I've created back when I participated in this thread. Feel free to use it and test it in your specific scenario. GoNotoCJKCore.zip

I don't remember the details of what IS and what IS NOT included. But you can check by yourself.

user6905 commented 1 year ago

Thank you Miguel. Meanwhile I installed Ubuntu und was able to use your fix.

from GoNotoCurrent, I build a own Font based on that.

@satbyy: May you consider to include that Font in your collection? I think that can be helpful for some others too.

@satbyy: BTW - Is GoNoto... a correct name for the fonts? According to OFL License I thought you must not use reserved names (RFNs). And Noto is a TM of Google.

satbyy commented 1 year ago

@user6905 and all, can you please download the font from the CI pipeline? Now there are two variants:

If you are satisfied, I will close this issue and make a new release.

user6905 commented 1 year ago

Generally the scirpt on Ubuntu works well and the created font included the Korean signs - I can confirm that. Thanks a lot. I don't have a full test suite but everything looks fine for several Asian languages.

I only wonder that GoNotoCurrent-Regular.ttf from your zip file has only 14.669.722 Bytes. Mine have 15.485.612 Bytes and 64623 Glyphs. I did not tested your font from the the zip so far.

evilaliv3 commented 3 months ago

Amazing thank you @satbyy and @xplip

We are using this receipe and specifically the Kurrent font within the @globaleaks project all together with the FPDF2 library.

This makes us possible to print PDF able to render texts coming by any international user!