icaruseu / mom-ca

Monasterium.net (http://www.monasterium.net/mom) - repository and collaborative archive
https://github.com/icaruseu/mom-ca/wiki
GNU General Public License v3.0
17 stars 11 forks source link

Use Unicode characters for combining diacritics #1054

Open brawer opened 2 years ago

brawer commented 2 years ago

In many documents, such as in this example, the Unicode character U+0364 Combining Latin Small Letter E is getting emulated by a superscript <sup>e</sup>. The result looks different from the original documents: The small e should be placed above the base letter, not afterwards. Also, the current emulation does not reflect what’s intended by Unicode. The present work-around with superscripts had certainly made some sense until March 2002, when Unicode version 3.2 introduced combining characters for medieval texts. But this was more than 20 years ago... meanwhile, computer systems have no problem displaying strings like “hoͤrt” or “zwoͤlften.”

Would it perhaps make sense to fix this data problem systematically, across all documents, in a global edit operation? Likewise for other combining diacritics, for example U+036E Combining Latin Small Letter V. (If a global edit is considered too risky, would it perhaps be possible to tell the data suppliers about the existence of combining diacritics in Unicode?)

GVogeler commented 2 years ago

Yes, of course, the XML solution is often a legacy work around. However, there are several things to consider:

  1. we don't know if <sup>e</sup> is identical to U+0364 in every case (probably it is in most cases, but it could be, for instance, an artefact from OCR conversion of footnote references to XML)
  2. the search index currently accepts a search for hoerrent. Entering ho&#x0364;rrent is harder for the user - on the other hand, collation can map ho&#x0364;rrent to horrent and hörrent, but not to hoerrent. Fuzzy search (e.g. hörent~0.8) might return similar results.
  3. The data responsibility sits with archive / researcher providing the data and the moderators checking update suggestions.

We would currently suggest to get in contact with the data provider and ask, if we, as the technical team, should do this - as, of course - it is much more up-to-date data than the current solution. Any other ideas/comments?