stenskjaer / samewords

Automatically annotate potentially ambiguous words in critical text editions made with LaTeX and reledmac.
MIT License
7 stars 1 forks source link

Some Unicode blocks not supported #25

Closed floriandk closed 6 years ago

floriandk commented 6 years ago

While trying some real life examples with branch issue-24 I noticed that some Unicode characters break compilation, i.e. it says "Starting conversion." and never comes to an end.

It is a bit curious which characters are affected. It seems to go by Unicode blocks and neither by frequency (e.g. typographical quotes and € don't work, Runes do) nor function.

\documentclass[a5paper]{scrartcl}

\usepackage{fontspec}
\setmainfont{Arial}

\usepackage[series={A},noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
Basic Latin: o 

Lat1: ô 

ExtA: ő

ExtB: ǫ

IPA: ɷ %all working

Spacin Mod:% ˚ %breaks

Comb Diacritics:%  oͦ %breaks

Greek: ώ %compiles

Cyrillic: ѻ %compiles

Cyr Supp: Ԛ %compiles

Runic: ᚮ  %compiles

IPA Ext: ᴏ ᴼ  %compiles

IPA Ext Suppl: ᶱ  %compiles

Comb Dia Suppl: % ᷕ %breaks 

Lat Ext A: ṓ %compiles

Greep Add: Ὧ %compiles

General Punct: %“⁐ %breaks!

Superscripts/Sunscripts: ₒ %compiles

Currency: %€₰ %breaks

[...]

Lat Ext C: ⱬ %compiles

[...]

Supplemental Punct: %⸀ %breaks

[...]

Lat Ext D: ꝏ %compiles

[...]

PUA: % %breaks
\pend
\endnumbering

\end{document}
stenskjaer commented 6 years ago

Okay. This is interesting. It looks like as it is now it works with all that I think looks like more or less exotic word-characters. What it cannot handle is the non-word blocks.

I don't know if there would be any obvious problems in just including and letting it handle all character ranges?

stenskjaer commented 6 years ago

Okay, this was fun. We should now have full unicode support. I have pushed updates to "branch-24", so when you get the time, maybe you could have a look?

As it is now all unicode points that don't fit into the Python re module definition of \w will not be processed at the same speed as the content of \w. Actually that's quite a bit. I compiled this list:

['Adlam',
 'Aegean Numbers',
 'Ahom',
 'Alchemical Symbols',
 'Anatolian Hieroglyphs',
 'Ancient Greek Musical Notation',
 'Ancient Greek Numbers',
 'Ancient Symbols',
 'Arabic Extended-A',
 'Arabic Mathematical Alphabetic Symbols',
 'Arabic Presentation Forms-A',
 'Arabic Presentation Forms-B',
 'Armenian',
 'Arrows',
 'Avestan',
 'Balinese',
 'Bamum',
 'Bamum Supplement',
 'Basic Latin',
 'Bassa Vah',
 'Batak',
 'Bengali',
 'Bhaiksuki',
 'Block Elements',
 'Bopomofo',
 'Bopomofo Extended',
 'Box Drawing',
 'Brahmi',
 'Braille Patterns',
 'Buginese',
 'Buhid',
 'Byzantine Musical Symbols',
 'CJK Compatibility',
 'CJK Compatibility Forms',
 'CJK Compatibility Ideographs',
 'CJK Radicals Supplement',
 'CJK Strokes',
 'CJK Symbols and Punctuation',
 'CJK Unified Ideographs',
 'CJK Unified Ideographs Extension A',
 'Carian',
 'Caucasian Albanian',
 'Chakma',
 'Cham',
 'Cherokee',
 'Combining Diacritical Marks',
 'Combining Diacritical Marks Extended',
 'Combining Diacritical Marks Supplement',
 'Combining Diacritical Marks for Symbols',
 'Combining Half Marks',
 'Common Indic Number Forms',
 'Control Pictures',
 'Coptic',
 'Coptic Epact Numbers',
 'Counting Rod Numerals',
 'Cuneiform',
 'Cuneiform Numbers and Punctuation',
 'Currency Symbols',
 'Cyrillic Extended-A',
 'Cyrillic Extended-B',
 'Cyrillic Extended-C',
 'Devanagari Extended',
 'Dingbats',
 'Domino Tiles',
 'Duployan',
 'Early Dynastic Cuneiform',
 'Egyptian Hieroglyphs',
 'Elbasan',
 'Emoticons',
 'Enclosed Alphanumeric Supplement',
 'Enclosed CJK Letters and Months',
 'Enclosed Ideographic Supplement',
 'Ethiopic',
 'Ethiopic Extended',
 'Ethiopic Extended-A',
 'Ethiopic Supplement',
 'General Punctuation',
 'Geometric Shapes',
 'Geometric Shapes Extended',
 'Georgian Supplement',
 'Glagolitic',
 'Glagolitic Supplement',
 'Gothic',
 'Grantha',
 'Greek Extended',
 'Gujarati',
 'Gurmukhi',
 'Halfwidth and Fullwidth Forms',
 'Hangul Compatibility Jamo',
 'Hangul Jamo Extended-A',
 'Hangul Jamo Extended-B',
 'Hangul Syllables',
 'Hanunoo',
 'Hebrew',
 'High Private Use Surrogates',
 'High Surrogates',
 'Ideographic Description Characters',
 'Ideographic Symbols and Punctuation',
 'Javanese',
 'Kaithi',
 'Kana Extended-A',
 'Kana Supplement',
 'Kanbun',
 'Kangxi Radicals',
 'Kannada',
 'Kayah Li',
 'Kharoshthi',
 'Khmer',
 'Khmer Symbols',
 'Khojki',
 'Khudawadi',
 'Lao',
 'Latin Extended-E',
 'Letterlike Symbols',
 'Linear A',
 'Linear B Ideograms',
 'Linear B Syllabary',
 'Lisu',
 'Low Surrogates',
 'Lycian',
 'Lydian',
 'Mahajani',
 'Mahjong Tiles',
 'Mandaic',
 'Manichaean',
 'Marchen',
 'Masaram Gondi',
 'Mathematical Operators',
 'Meetei Mayek',
 'Meetei Mayek Extensions',
 'Mende Kikakui',
 'Miscellaneous Mathematical Symbols-A',
 'Miscellaneous Mathematical Symbols-B',
 'Miscellaneous Symbols',
 'Miscellaneous Symbols and Arrows',
 'Miscellaneous Symbols and Pictographs',
 'Miscellaneous Technical',
 'Modi',
 'Mongolian',
 'Mongolian Supplement',
 'Mro',
 'Multani',
 'Musical Symbols',
 'Myanmar',
 'Myanmar Extended-B',
 'NKo',
 'New Tai Lue',
 'Newa',
 'Number Forms',
 'Nushu',
 'Ogham',
 'Ol Chiki',
 'Old Italic',
 'Old Permic',
 'Old Persian',
 'Old South Arabian',
 'Old Turkic',
 'Optical Character Recognition',
 'Oriya',
 'Ornamental Dingbats',
 'Osage',
 'Osmanya',
 'Pau Cin Hau',
 'Phags-pa',
 'Phaistos Disc',
 'Phoenician',
 'Playing Cards',
 'Private Use Area',
 'Rejang',
 'Rumi Numeral Symbols',
 'Runic',
 'Samaritan',
 'Saurashtra',
 'Sharada',
 'Shorthand Format Controls',
 'Siddham',
 'Sinhala',
 'Sinhala Archaic Numbers',
 'Small Form Variants',
 'Sora Sompeng',
 'Soyombo',
 'Spacing Modifier Letters',
 'Specials',
 'Sundanese Supplement',
 'Superscripts and Subscripts',
 'Supplemental Arrows-A',
 'Supplemental Arrows-B',
 'Supplemental Arrows-C',
 'Supplemental Mathematical Operators',
 'Supplemental Punctuation',
 'Sutton SignWriting',
 'Syloti Nagri',
 'Syriac Supplement',
 'Tagalog',
 'Tagbanwa',
 'Tai Le',
 'Tai Tham',
 'Tai Viet',
 'Tai Xuan Jing Symbols',
 'Takri',
 'Tamil',
 'Tangut',
 'Tangut Components',
 'Telugu',
 'Thaana',
 'Thai',
 'Tibetan',
 'Tifinagh',
 'Tirhuta',
 'Transport and Map Symbols',
 'Ugaritic',
 'Unified Canadian Aboriginal Syllabics Extended',
 'Vai',
 'Variation Selectors',
 'Vedic Extensions',
 'Vertical Forms',
 'Yi Radicals',
 'Yi Syllables',
 'Yijing Hexagram Symbols',
 'Zanabazar Square']

This may be a problem when someone might want to make an edition with text in one of these codeblocks, like Devanagari, Runic, Old Persian or Linear B to just name a few.

Maybe I should consider adding a configuration to indicate the language(s). That way it would easy to switch when necessary without overmatching 90% of the time.

floriandk commented 6 years ago

No more break off and also correct matching here 👍

That leaves the problem how to acquire information about whether a character should be regarded a word-charcter or a word-boundry without needing to check all 109,242 Unicode characters manually:

\documentclass[a5paper]{scrartcl}

\usepackage{fontspec}
\setmainfont{Junicode}

\usepackage[series={A},noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
o “o ⸀o o. o 
\edtext{o}{%
    \Afootnote{test}}
\pend
\endnumbering

\end{document}

->

\documentclass[a5paper]{scrartcl}

\usepackage{fontspec}
\setmainfont{Junicode}

\usepackage[series={A},noledgroup,draft]{reledmac}

\begin{document}

\beginnumbering
\pstart
\sameword{o} “o ⸀o \sameword{o}. \sameword{o} 
\edtext{\sameword[1]{o}}{%
    \Afootnote{test}}
\pend
\endnumbering

\end{document}

The information should be available in the Unicode database, but I haven't yet seen a quick way to extract it. (Just come to think of that there are even some cases where the same char can either be a word-char or a word-boundary, like the suspension dot in manuscripts ,depending on context. But as this is notoriously difficult even when editing, it's probably best to let this rest for now and focus on standard usage of characters…)

(That you have "Basic Latin" in the list above is probably just a mix-up and nothing you use in the actual code, do you?)

stenskjaer commented 6 years ago

In the example you give. As you see it, should the composed character be interpreted as identical to "o". That doesn't seem to be the case for me. I'm almost afraid to ask, but can we think of cases?

Of course another problem is the curious case of two different code points with the same graphical representation (there are some examples, and I'm sure you can remind me of them). Those will not match although they look identical to the reader.

About those I imagine a list of such cases could be compiled and it could be taken into consideration.

About "Basic Latin", that confused me too. I don't really know what it's doing there, so it may not be excluded that the way I checked for matches in the different blocks was wrong. But anyway.

EDIT: By the way. I think I'll close this for now. I have made a new issue for enabling configuration of language.

floriandk commented 6 years ago

I should have been more detailled about my examples: (2E00 - RIGHT ANGLE SUBSTITUTION MARKER) is a punctuation mark as well as (201C - LEFT DOUBLE QUOTATION MARK). So it should stand neatly besides the o, but in both cases I'd expect the o to be matched, “\sameword{o} ⸀\sameword{o}. These two examples were to show that there is a problem not only with some of the very frequent typographical quotation marks that reside in Unicode's "General Punctuation", but also with some more arcane marks that appear in critical editions nonetheless. Unfortunately not all possible punctuation characters are neatly collected in separate ranges. For the Western Medieval Latin tradition there has been an attempt to compile a comprehensive list, cf MUFI especially the Character Recommendation, chapter 5 (p. 138ff). But already Biblical criticism has other/more, not to mention Eastern traditions. Would it be feasible to add another parameter "punctuation" to the json-file where the users could add everything which isn't a part of the main Unicode punctuation ranges but should be treated as punctuation anyway?

I'm almost afraid to ask, but can we think of cases?

Are you thinking of the "canonically equivalent" characters, eg. that 006F (o) and 0308 (̈) are to be considered the same as 00F6 (ö)? As far as I know it is still not equally well implemented in different pieces of systems and software. So it is possibly very complex to support anything that python itself doesn't support. The ö-example and some others I tried work with your script as is. As long as these basic things work, I wouldn't worry more about it for now.

Of course another problem is the curious case of two different code points with the same graphical representation (there are some examples, and I'm sure you can remind me of them). Those will not match although they look identical to the reader.

I'd definitely suggest no to go into this. For one, it is a mess: There are deprecated characters, whole character ranges that have been split up (most prominently Coptic that used to be a subset of Greek, but is completely on its own now) and, as you mention, characters that look the same but have different codepoints (you could start with many Cyrillic letters looking precisely like Latin ones and go on to get completely lost in the mathematical ranges…). Mixing different codepoints to print something you consider to be the same letter is bad practice anyway, so I wouldn't go further than to perhaps mention in the readme that the script expects character-codepoints to be consistent.

stenskjaer commented 6 years ago

Okay, so this is very useful. I am improving the punctuation tokenization now to include more characters. I will describe that in #24 when I'm done.

About the composing characters (which I mistook the punctuation as): All composing characters such as the suggested 006F (o) and 0308 (̈) = 00F6 (ö) are normalized to the single character glyphs (00F6 here) before processing so that sholdn't be a problem.

I am adding a note on this in the readme, as that actually is a change in the file that can be very hard to perceive.