[gen-metadata] More aggressively declare subsets in METADATA files

davelab6 commented 2 years ago

fonts.google.com launched a new specimen page last week, that reuses the improvements developed for the fonts.google.com/noto section specimen pages last year.

However, this has highlighted a problem with Kumbh Sans and Nabla, where they had Google Fonts Latin Plus glyph set support, but their METADATA files didn't declare the math subset, and therefore the fonts API stripped those characters from the fonts served to the specimen page, which was trying to show the full glyph set. The Kumbh Sans designer reported this (https://github.com/google/fonts/issues/5067) with a screenshot.

In https://github.com/google/fonts/pull/5095 I add the math subset to the 2 families we've noticed this issue surfaced, and I believe the root cause is that the METADATA generator has a threshold for when it auto-adds subsets to a METADATA file, such that when a font has only 1% of the math subset's characters, it won't add it but when 100% of latin-core subset is in the font, it is added. I forget the details on the threshhold, I think it was set per subset.

It seems that we probably ought to be more aggressive about declaring subsets, like if even 1 character from a subset exists in any font in the family, it should be declared in METADATA, and then using comments to include the % the tool found to inform later hand curation of the METADATA file.

We probably then need to make a special version of the METADATA generator to lint all the existing library and then PR a bunch of updates to a bunch of families.

It would also be good for the linter to tell us what unicode characters are in any font in a family that are not in any declared subset encoding, and not in any encoding (https://github.com/googlefonts/glyphsets/tree/main/Lib/glyphsets/encodings)

rsheeter commented 2 years ago

Would be good to improve CJK detection as well. We have an ongoing issue where that tool adds many CJK subsets when only one is appropriate. The "I only have an hour" approach would be to simply prompt the user if this occurs and make them pick which one(s) they want. Given more time, identify characters that strongly suggest (high frequency, only in that script) support for specific CJKs and use them to autopick.

m4rc1e commented 2 years ago

We discussed this issue in our team meeting last Friday.

We came to the conclusion that decreasing the number of glyphs needed in order to activate a subset won't work because users have the ability to search for fonts by subsets. If we start enabling subsets just because the font contains a single glyph within a given subset, users will get annoyed because the font doesn't fully support the subset fully.

Language drop down enables users to search for fonts by subset

@nathan-williams Is there a reason why we're not using the font's cmap in order to construct the glyphs palette? if we simply used the cmap, we wouldn't get fallback glyphs.

nabla on sandbox showing fallback glyphs because they don't exist in the font being used to display the glyphs

cc @chrissimpkins @davelab6

davelab6 commented 2 years ago

However, the CMAP of the fonts in the API is a subset of the CMAP of the upstream fonts, because the upstream HAS the glyphs:

But because the subset that contains the character isn't declared, the character is removed by the API and so unavailable to the Catalog:

I personally believe the subsets that cover the glyphs in the font should be enabled, so that all glyphs can be accessed; the language dropdown should be fixed, so that filtering happens by actual script support, even if the subset is included - because the API deleting glyphs is a foundational issue with the API that should be fixed as a priority, and then the Catalog should accommodate a correct API.

But @vv-monsalve and @m4rc1e make a fair argument that fixing the Catalog glyph table preview issue by modifying the API subsets configured for families will break the Catalog language filter, and the Catalog language filter is a more important Catalog feature, so we should not modify the subsets as they currently work, and we should modify the Catalog glyph table preview functionality to fix its issue.

Taking a higher level view, I think there are 3 semantics which are being conflated in the GF system architecture:

unicode-range subset definitions, which are called encodings in the github.com/googlefonts/glyphsets repo
- These are long overdue for a detailed analysis and investigation, based on maturity of the glyphsets in that repo which are what type designers are expected to develop
catalog unicode standard segmentation/grouping for the glyph table preview, which is in the github.com/googlefonts/lang repo developed by @twardoch
- which maybe is okay as-is, or maybe should be modified to take into account not only [Unicode General Categories], (https://en.wikipedia.org/wiki/Unicode#General_Category_property) but also the unicode-range subset definitions
actual language support of a family
- I propose this should be addressed by advancing the shaperglot project and then integrating that into the GF back-end

We should bring these into harmony and use the appropriate data for the appropriate Catalog feature, but that's a larger (2023) effort.

Therefore I think the short term solution to the 'unavailable glyphs being seen' issue is not, as per this issue title, to _more aggressively declare subsets in METADATA files), but rather, the Catalog front-end code team (@nathan-williams :) should fix it in that code.

But to start on bringing those things into harmony, for (1) we should get better data on which characters exist in which font files that do not exist in any characters defined in the families' METADATA subset associations, and I'd like to ask @m4rc1e to gather that data in a Google Sheet

davelab6 commented 2 years ago

Ha, just after posting this, I caught up on google chatrooms, and Nate said he already wrote a patch that excludes unavailable glyphs, which will be applied the next time such a family is pushed :)

Also @chrissimpkins noted the math subset was added to Nabla (here) - but it turns out that "per mille sign" glyph is not from the math subset..... its actually in Adlam!

So that seems weird and makes a detailed analysis and investigation of the subsets more important, although not urgent.

rsheeter commented 2 years ago

Please don't aggressively add subsets that are barely supported, it'll confuse all sorts of things.

rsheeter commented 2 years ago

And +1 for the FE to fix. Long term solution is to move away from subsets as they exist today entirely.

nathan-williams commented 2 years ago

Is there a list of affected families? They will have to be version bumped once the fix is rolled out.

googlefonts / gftools

[gen-metadata] More aggressively declare subsets in METADATA files #602