notofonts / notofonts.github.io

Distribution site for Noto fonts
https://notofonts.github.io/
Apache License 2.0
148 stars 15 forks source link

Megamerge with largest possible CJK #33

Closed davelab6 closed 8 months ago

davelab6 commented 8 months ago

In https://github.com/notofonts/noto-docs/issues/35 I requested new megamerged Noto fonts, which are now in this repo, so filing this follow up request here :)

https://github.com/notofonts/notofonts.github.io/tree/main/megamerge says,

No monster scripts. (CJK, Duployan, SignWriting)

However, the user I know who requeste this would like the following list of language codes covered: en, ar, zh, zh-TW, nl, en-GB, fr, de, it, ja, ko, pl, pt, ru, es, th, tr

This is a much smaller set of scripts than what is available in the 'mega' merge, but of course the CJK parts mean that this request is to start with the Noto Sans fonts needed for the other languages listed, then add the CJK with the characters prioritized some way, until you hit the OT1.9 glyph index limit; and perhaps then subset the font from there for a more reasonable filesize.

simoncozens commented 8 months ago

With the megamerge script producing NotoSansLiving/NotoSerifLiving, the following script will produce a merged CJK font covering just the languages mentioned above:

from fontTools.ttLib import TTFont
from fontTools.merge import Merger, Options
from fontTools.subset import Subsetter
from fontTools.subset import Options as SubsetOptions

included = []
included.extend(range(0x0,0x250)) # Basic Latin
included.extend(range(0x400,0x500)) # Cyrillic
included.extend(range(0x600,0x6ff)) # Arabic
included.extend(range(0xe00,0xf00)) # Thai
included.extend(range(0x3000,0x3100)) # CJK Symbols and Punctuation, hiragana and katakana
included.extend(range(0x3130,0x3190)) # Hangul Compatibility Jamo
included.extend(range(0xac00,0xd7a4)) # Hangul Syllables
included.extend(range(0x1100,0x11ff)) # Hangul Jamo

# CJK Han ideographs with mappings to other encoding systems seem likely to be the most
# frequently used characters. This file generated by 
#  wget https://www.unicode.org/Public/UCD/latest/ucd/Unihan.zip
#  unzip Unihan.zip
#  awk '{print $1}' Unihan/Unihan_OtherMappings.txt | sort -u > han.txt

with open('han.txt') as f:
    for line in f:
        if f.startswith('U+'):
            included.append(int(line[2:], 16))

assert len(included) < 65536, 'Too many characters to fit in a single font'
assert 0x9274 in included

def subset_a_font(font):
    subsetter = Subsetter(options=SubsetOptions(notdef_outline=True))
    subsetter.populate(unicodes=included)
    subsetter.subset(font)
    return font

for modulation in ["Sans", "Serif"]:
    lgc = subset_a_font(TTFont(f'Noto{modulation}Living-Regular.ttf'))
    lgc.save(f"Noto{modulation}Living-Regular-subset.ttf")
    # These CJK fonts generated by
    # $ fonttools varLib.instancer -o NotoSansTC-Regular.ttf NotoSansCJKtc-VF.ttf wght=400
    # $ fonttools varLib.instancer -o NotoSerifTC-Regular.ttf NotoSerifCJKtc-VF.ttf wght=400
    # with source files from https://github.com/notofonts/noto-cjk/tree/main/Sans/Variable/TTF
    # and https://github.com/notofonts/noto-cjk/tree/main/Serif/Variable/TTF respectively
    cjk = subset_a_font(TTFont(f'Noto{modulation}TC-Regular.ttf'))
    assert 0x9274 in cjk.getBestCmap()
    cjk.save(f"Noto{modulation}TC-Regular-subset.ttf")

    merger = Merger(options=Options(drop_tables=["vmtx", "vhea", "MATH"]))
    merged = merger.merge([f"Noto{modulation}Living-Regular-subset.ttf", f"Noto{modulation}TC-Regular-subset.ttf"])
    merged.save(f"Noto{modulation}LivingCJK-Regular.ttf")