justinpenner / TalkingLeaves

A GlyphsApp plugin to help you explore the world’s languages and writing systems
MIT License
26 stars 2 forks source link

“What happens if I remove ...?” #22

Open justanotherfoundry opened 3 months ago

justanotherfoundry commented 3 months ago

Thanks for this really helpful tool! I have used quite a lot in the last few days to check and refine my character set.

Many years ago, I wrote a script that does something similar but checks against the Unicode CLDR, more specifically, the way Font Book on macOS determines the supported languages (which seems to be based on the CLDR).

I just uploaded the code: https://github.com/justanotherfoundry/font-production/tree/master/import%20CLDR and https://github.com/justanotherfoundry/freemix-glyphsapp/blob/master/Font%20Book%20Checker.py

As you can see, this is a very similar approach.

My script also determines characters that are not required in any of the supported languages. In other words, deleting these characters would not change the list of supported languages. This is a good way of finding useless characters in the character set as they are used only in languages that are not fully supported anyway. (I believe most of the fonts out there have a lot of these useless characters.)

Before simply relying on the CLDR, and removing all these characters, I’d like to check what TalkingLeaves and Hyperglot say about them (I don’t have much of an opinion (yet) which of the two is more correct). So, what happens if I remove this or that character? Will the list of supported languages as per TalkingLeaves change? At the moment, checking this is very tedious: Essentially, I need to go through the list of incomplete languages and try to spot the languages where all missing characters are among the ones I just deleted.

Would it be possible to have TalkingLeaves output the list as text in the Macro Panel? Then I could simply start TalkingLeaves, copy the output, delete some characters, start TalkingLeaves again, and do a text diff. If nothing changes then the deleted characters were indeed useless as per CLDR as well as Hyperglot.

As a side note, it would be nice to have a palette that gives information on the currently selected glyph (or glyphs): which languages it is required for according to Hyperglot and according to the CLDR, plus the number of speakers, and whether these languages are complete or not. Plus, maybe a Wikipedia link. Then I could make up my mind whether to keep or remove the glyphs, one by one. Maybe I will write something like that at some point.

justinpenner commented 3 months ago

I love that palette idea! It would be interesting to see what languages require the selected character(s). A lot of Unicode characters have a Wikipedia article so that could work, too.

I don't know much about the CLDR yet, so I definitely need to dig into that and see how it might be useful for TalkingLeaves. The update I pushed yesterday adds a data.py module which lays some of the groundwork for integrating more data sources like Shaperglot and CLDR. It'll be some work to figure out the best ways to merge the data together and deal with differences when, for example, Hyperglot and Shaperglot may have slightly different orthography definitions for the same language.

I think once I've begun integrating multiple data sources beyond just Hyperglot, then it would be useful to work on making TalkingLeaves more usable as an API, for users who want to write scripts. For now, you can already write scripts that import it as a module, with the big caveat that your scripts might break when TalkingLeaves is updated.

Here's an example of how you could print a list of chars in the font that aren't used by any languages that your font has completed:

from TalkingLeaves.data import Data
from TalkingLeaves.utils import flatten

data = Data()

# Don't need this table, but it generates data.completeLangs
_ = data.langsAsTable('Latin', Glyphs.font, True, True)

completeNames = list(data.completeLangs.loc[:, 'name'])
completeLangs = data.langs[data.langs['name'].isin(completeNames)]
completeCharsets = list(completeLangs.loc[:, 'chars'])
completeChars = set(flatten(completeCharsets))
fontChars = set(g.string for g in Glyphs.font.glyphs)

# Unneeded chars
print(sorted(fontChars - completeChars))

From there you may want to filter out punctuation and symbols, since Hyperglot's orthography definitions only cover letters and marks.

jenskutilek commented 3 months ago

As a side note, it would be nice to have a palette that gives information on the currently selected glyph (or glyphs): which languages it is required for according to Hyperglot and according to the CLDR, plus the number of speakers, and whether these languages are complete or not. Plus, maybe a Wikipedia link. Then I could make up my mind whether to keep or remove the glyphs, one by one. Maybe I will write something like that at some point.

I wrote a plugin that shows such a palette for the current glyph, also based on Unicode CLDR: https://github.com/jenskutilek/UnicodeInfo-Glyphs

screenshot

Feel free to reuse/adapt parts if you need any :)

justanotherfoundry commented 3 months ago

@justinpenner Thanks! That looks really promising. However, I am getting a beach ball when I run this code. Also, the regular TalkingLeaves now crashes. I will submit this as a new issue.

justanotherfoundry commented 3 months ago

@jenskutilek Aha, I knew someone must have had this kind of idea. If I will pursue it any further I’ll surely build on your plugin. Thanks!

justanotherfoundry commented 3 months ago

I played around with the “unneeded characters” script above. The list includes the Cyrillic Ѐ, which is, according to Hyperglot, required for Bulgarian, and Bulgarian is complete in my font. If I delete this glyph from the font, TalkingLeaves shows Bulgarian as incomplete with only this character missing. This means the list of “unneeded characters” reported by the script is not what I would expect it to be. I am looking for characters I can remove without reducing the list of complete languages.

justinpenner commented 3 months ago

Does it work if you change the first argument in this line to Cyrillic instead of Latin?

_ = data.langsAsTable('Latin', Glyphs.font, True, True)

If you already tried that, then I'm not sure what's going wrong. It works correctly for me if I create a font and add Bulgarian to it via TalkingLeaves. The script reports no unneeded Cyrillic chars for me:

[' ', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
justanotherfoundry commented 3 months ago

Does it work if you change the first argument in this line to Cyrillic instead of Latin?

It does! Seems like we’d have to loop over several scripts? Sorry, I don’t know enough about pandas so I cannot debug the code.

justinpenner commented 3 months ago

Seems like we’d have to loop over several scripts? Sorry, I don’t know enough about pandas so I cannot debug the code.

Exactly, you can loop over multiple scripts like this, and you don't need to add any new pandas code:

from TalkingLeaves.data import Data
from TalkingLeaves.utils import flatten

data = Data()

# Don't need this table, but it generates data.completeLangs
completeNames = []
for script in ['Latin', 'Cyrillic']:
    _ = data.langsAsTable(script, Glyphs.font, True, True)
    completeNames.extend(list(data.completeLangs.loc[:, 'name']))

completeLangs = data.langs[data.langs['name'].isin(completeNames)]
completeCharsets = list(completeLangs.loc[:, 'chars'])
completeChars = set(flatten(completeCharsets))
fontChars = set(g.string for g in Glyphs.font.glyphs)

# Unneeded chars
print(sorted(fontChars - completeChars))

By the way, I found pandas surprisingly easy to learn. It's very Pythonic and I found it easier to understand after looking up a "cheat sheet" of common commands. I've only scratched the surface so far, but it didn't take long to learn the basics so I could plug it in to TalkingLeaves.

justanotherfoundry commented 2 months ago

@jenskutilek I implemented what I described above, on the basis of your Unicode Info plug-in: https://github.com/justanotherfoundry/UnicodeInfo-Glyphs Did you get my e-mail?

jenskutilek commented 2 months ago

@justanotherfoundry it took me a while to answer, sorry for that! I hope you got my reply by now.