rosettatype / hyperglot

Hyperglot: a database and tools for detecting language support in fonts
http://hyperglot.rosettatype.com
GNU General Public License v3.0
166 stars 23 forks source link

Merge with Shaperglot #152

Open yanone opened 8 months ago

yanone commented 8 months ago

David (rather, the @rosetta account) pointed out to me on Mastodon that a combination or merge of Shaperglot with Hyperglot has already been envisioned, without stating any further details. I want to elaborate on that.

We’ve discussed this internally at Google Fonts and are indeed open for and interested in the idea of contributing the shaping analysis part of Shaperglot to Hyperglot and discard our own nascent language definitions database in the process.

We believe this would be to the mutual benefit of both Google Fonts to gain access to Hyperglot’s excellent database, as well as for Hyperglot users to gain access to Google Fonts’ extensive knowledge in Font QA with regards to shaping in specific.

There are a couple of concerns, tho.

First up, the GPLv3 license of Hyperglot may prevent its application within Google or even other corporations. For instance, Fontbakery was relicensed to Apache so that Microsoft can adopt it. I’m going to forward this issue internally after posting it to invite other voices to join the conversation.

Secondly, we are wondering about a conduct for contributions and collaboration in the future. This wouldn’t be a one-off contribution of code. Shaperglot is in active development and increasingly used in production of large font families at Google, so ideally, some of us would be accepted as collaborators on your repository.

Thirdly, there is the question of how to integrate Shaperglot in practical terms. I haven't delved into how Hyperglot works on the code level, but I see several approaches:

  1. Shaperglot’s code get fully integrated into Hyperglot commands, essentially dissolving into Shaperglot.
  2. Shaperglot becomes part of the Hyperglot package but remains its own code, possibly accessible through a separate CLI command.
  3. If neither of the above two options are satisfactory, or an active collaboration on your repository is out of the question, the third option would be to keep Shaperglot as an entirely separate package under Google Fonts as it is now, but rewire it to use the Hyperglot database inferred simply as a Python package dependency. This would still satisfy the benefits for both Google Fonts as well as Hyperglot users, given that Hyperglot users learn about the existence of a separate Shaperglot.

One detail to point out is that, next to code changes, we would also ask for changes to the language database to add character sequences such as {ÍJ́} for Dutch (see here) which are then checked for not containing .notdef and not having unattached marks, instead of merely existing as codepoints in the fonts. Recently defined (at Google Fonts) Sub-Saharan African languages contain a lot of those sequences. (Note that we’re only talking about the encodings here and not about the sample texts and other definitions which are Google-specific and will remain there in that package).

Additionally, Shaperglot has manually defined shaping definitions that would also have to make it into the new tool in some form, see here (unmerged PR). Please note that the majority of currently defined shaping definition files in Shaperglot, which are generated by code (as noted in their first line), would disappear in favour of running those same checks live. This is one field of active development at Shaperglot and would not make it into Hyperglot in its current form. Only some manually defined definitions (like the ones for Dutch and Turkish shown in the linked PR, and possibly more in the future) are required to be explicitly defined. This could become part of Hyperglot's existing yaml files, or hosted in separate files.

MrBrezina commented 8 months ago

Hi Jan,

thank you for reaching out. This is exciting news! And of course, we are interested in collaboration as well.

My comment on Mastodon was based on an email I have got from @davelab6 a while back. Checking shaping has been on our roadmap and shaperglot seems to fit right in. Since we did not see much development in the shaperglot repo, we thought we should try to tackle a few low-hanging fruits such as check for joining behaviour in Arabic (of course, I agree with your initial comment on Mastodon that this should be a requirement) and check for existing mark positioning. Basically, to see whether and how it would fit the detection workflow. @kontur will have merged these changes to the master branch by now, so you can inspect what you would have done and if you would have done it differently.

These checks are only part of the CLI. Including them in the web app requires access to the font which is problematic in terms of users’ licence.

I have also updated the simple roadmap in the Hyperglot README and I will keep on expanding on it in the issues.

To your concerns:

  1. I spoke with Dave on Friday and he said GPL was a non-issue, so I am non wiser now. :) I think at some point there was a worry about the data being consulted with Wikipedia and Omniglot, but I am pretty sure this is a non-issue as well. These are sets of characters and orthographies. They have to be in the public domain.

As a side note, I would very much like to improve the sources where only Wikipedia and Omniglot are mentioned, by commissioning linguists to add more authoritative sources.

  1. If some of you want to become contributors, you are most welcome. Just let us know whom and we will make it happen.

  2. We do not have a firm opinion about this. It would probably depend on whether it will need to be used in different ways in Hyperglot and in Google Fonts QA. Hyperglot probably tends to use more soft requirements to minimize false rejections.

We would be happy to call about any of this. The next two weeks are a bit busy for me, but I will try to find the time.

On 12. 1. 2024, at 16:28, Yanone @.***> wrote:

David (rather, the @Rosetta https://github.com/Rosetta account) pointed out to me on Mastodon that a combination or merge of Shaperglot with Hyperglot has already been envisioned, without stating any further details. I want to elaborate on that.

We’ve discussed this internally at Google Fonts and are indeed open for and interested in the idea of contributing the shaping analysis part of Shaperglot to Hyperglot and discard our own nascent language definitions database in the process.

We believe this would be to the mutual benefit of both Google Fonts to gain access to Hyperglot’s excellent database, as well as for Hyperglot users to gain access to Google Fonts’ extensive knowledge in Font QA with regards to shaping in specific.

There are a couple of concerns, tho.

First up, the GPLv3 license of Hyperglot may prevent its application within Google or even other corporations. For instance, Fontbakery was relicensed to Apache so that Microsoft can adopt it. I’m going to forward this issue internally after posting it to invite other voices to join the conversation.

Secondly, we are wondering about a conduct for contributions and collaboration in the future. This wouldn’t be a one-off contribution of code. Shaperglot is in active development and increasingly used in production of large font families at Google, so ideally, some of us would be accepted as collaborators on your repository.

Thirdly, there is the question of how to integrate Shaperglot in practical terms. I haven't delved into how Hyperglot works on the code level, but I see several approaches:

Shaperglot’s code get fully integrated into Hyperglot commands, essentially dissolving into Shaperglot. Shaperglot becomes part of the Hyperglot package but remains its own code, possibly accessible through a separate CLI command. If neither of the above two options are satisfactory, or an active collaboration on your repository is out of the question, the third option would be to keep Shaperglot as an entirely separate package as it is now, but rewire it to use the Hyperglot database inferred simply as a Python package dependency. This would still satisfy the benefits for both Google Fonts as well as Hyperglot users, given that Hyperglot users learn about the existence of a separate Shaperglot. One detail to point out is that, next to code changes, we would also ask for changes to the language database to add character sequences such as {ÍJ́} for Dutch (see here https://github.com/googlefonts/lang/blob/aa5047ae80a1d216e5e0a3d1c68bde9d0fb3abf7/Lib/gflanguages/data/languages/nl_Latn.textproto#L17) which are then checked for not containing .notdef and not having unattached marks, instead of merely existing as codepoints in the fonts. Recently defined (at Google Fonts) Sub-Saharan African languages contain a lot of those sequences.

Additionally, Shaperglot has manually defined shaping definitions that would also have to make it into the new tool in some form, see here https://github.com/googlefonts/shaperglot/pull/38/files#diff-b2c6c758235e37f1644286514ae83ca8bcb37b35646309b253b3c6c99d2129e3 (unmerged PR). Please note that the majority of currently defined shaping definition files in Shaperglot, which are generated by code, would disappear in favour of running those same checks live. This is one field of active decvelopment at Shaperglot and would not make it into Hyperglot in its current form.

— Reply to this email directly, view it on GitHub https://github.com/rosettatype/hyperglot/issues/152, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADWQY4CLJKYQEHHW556AILYOFJDBAVCNFSM6AAAAABBYJ2RMOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3TSMJQGQ2TMMI. You are receiving this because you are subscribed to this thread.