stenskjaer / samewords

Automatically annotate potentially ambiguous words in critical text editions made with LaTeX and reledmac.
MIT License
7 stars 1 forks source link

Config option to indicate languages not matching the `\w` character class #28

Closed stenskjaer closed 6 years ago

stenskjaer commented 6 years ago

Currently the assumption is that text consists of material matching the \w.

If a user has an edition outside that class, it will work significantly slower. But we cant just turn faster matching (currently basically \w+) up to match all possible code blocks, because that would make matching the few exceptional cases (\\{} and punctuation) much more demanding.

So it might be a good idea to make it possible to configure it to use one or more of the other languages. The full list of not included material is

['Adlam',
 'Aegean Numbers',
 'Ahom',
 'Alchemical Symbols',
 'Anatolian Hieroglyphs',
 'Ancient Greek Musical Notation',
 'Ancient Greek Numbers',
 'Ancient Symbols',
 'Arabic Extended-A',
 'Arabic Mathematical Alphabetic Symbols',
 'Arabic Presentation Forms-A',
 'Arabic Presentation Forms-B',
 'Armenian',
 'Arrows',
 'Avestan',
 'Balinese',
 'Bamum',
 'Bamum Supplement',
 'Basic Latin',
 'Bassa Vah',
 'Batak',
 'Bengali',
 'Bhaiksuki',
 'Block Elements',
 'Bopomofo',
 'Bopomofo Extended',
 'Box Drawing',
 'Brahmi',
 'Braille Patterns',
 'Buginese',
 'Buhid',
 'Byzantine Musical Symbols',
 'CJK Compatibility',
 'CJK Compatibility Forms',
 'CJK Compatibility Ideographs',
 'CJK Radicals Supplement',
 'CJK Strokes',
 'CJK Symbols and Punctuation',
 'CJK Unified Ideographs',
 'CJK Unified Ideographs Extension A',
 'Carian',
 'Caucasian Albanian',
 'Chakma',
 'Cham',
 'Cherokee',
 'Combining Diacritical Marks',
 'Combining Diacritical Marks Extended',
 'Combining Diacritical Marks Supplement',
 'Combining Diacritical Marks for Symbols',
 'Combining Half Marks',
 'Common Indic Number Forms',
 'Control Pictures',
 'Coptic',
 'Coptic Epact Numbers',
 'Counting Rod Numerals',
 'Cuneiform',
 'Cuneiform Numbers and Punctuation',
 'Currency Symbols',
 'Cyrillic Extended-A',
 'Cyrillic Extended-B',
 'Cyrillic Extended-C',
 'Devanagari Extended',
 'Dingbats',
 'Domino Tiles',
 'Duployan',
 'Early Dynastic Cuneiform',
 'Egyptian Hieroglyphs',
 'Elbasan',
 'Emoticons',
 'Enclosed Alphanumeric Supplement',
 'Enclosed CJK Letters and Months',
 'Enclosed Ideographic Supplement',
 'Ethiopic',
 'Ethiopic Extended',
 'Ethiopic Extended-A',
 'Ethiopic Supplement',
 'General Punctuation',
 'Geometric Shapes',
 'Geometric Shapes Extended',
 'Georgian Supplement',
 'Glagolitic',
 'Glagolitic Supplement',
 'Gothic',
 'Grantha',
 'Greek Extended',
 'Gujarati',
 'Gurmukhi',
 'Halfwidth and Fullwidth Forms',
 'Hangul Compatibility Jamo',
 'Hangul Jamo Extended-A',
 'Hangul Jamo Extended-B',
 'Hangul Syllables',
 'Hanunoo',
 'Hebrew',
 'High Private Use Surrogates',
 'High Surrogates',
 'Ideographic Description Characters',
 'Ideographic Symbols and Punctuation',
 'Javanese',
 'Kaithi',
 'Kana Extended-A',
 'Kana Supplement',
 'Kanbun',
 'Kangxi Radicals',
 'Kannada',
 'Kayah Li',
 'Kharoshthi',
 'Khmer',
 'Khmer Symbols',
 'Khojki',
 'Khudawadi',
 'Lao',
 'Latin Extended-E',
 'Letterlike Symbols',
 'Linear A',
 'Linear B Ideograms',
 'Linear B Syllabary',
 'Lisu',
 'Low Surrogates',
 'Lycian',
 'Lydian',
 'Mahajani',
 'Mahjong Tiles',
 'Mandaic',
 'Manichaean',
 'Marchen',
 'Masaram Gondi',
 'Mathematical Operators',
 'Meetei Mayek',
 'Meetei Mayek Extensions',
 'Mende Kikakui',
 'Miscellaneous Mathematical Symbols-A',
 'Miscellaneous Mathematical Symbols-B',
 'Miscellaneous Symbols',
 'Miscellaneous Symbols and Arrows',
 'Miscellaneous Symbols and Pictographs',
 'Miscellaneous Technical',
 'Modi',
 'Mongolian',
 'Mongolian Supplement',
 'Mro',
 'Multani',
 'Musical Symbols',
 'Myanmar',
 'Myanmar Extended-B',
 'NKo',
 'New Tai Lue',
 'Newa',
 'Number Forms',
 'Nushu',
 'Ogham',
 'Ol Chiki',
 'Old Italic',
 'Old Permic',
 'Old Persian',
 'Old South Arabian',
 'Old Turkic',
 'Optical Character Recognition',
 'Oriya',
 'Ornamental Dingbats',
 'Osage',
 'Osmanya',
 'Pau Cin Hau',
 'Phags-pa',
 'Phaistos Disc',
 'Phoenician',
 'Playing Cards',
 'Private Use Area',
 'Rejang',
 'Rumi Numeral Symbols',
 'Runic',
 'Samaritan',
 'Saurashtra',
 'Sharada',
 'Shorthand Format Controls',
 'Siddham',
 'Sinhala',
 'Sinhala Archaic Numbers',
 'Small Form Variants',
 'Sora Sompeng',
 'Soyombo',
 'Spacing Modifier Letters',
 'Specials',
 'Sundanese Supplement',
 'Superscripts and Subscripts',
 'Supplemental Arrows-A',
 'Supplemental Arrows-B',
 'Supplemental Arrows-C',
 'Supplemental Mathematical Operators',
 'Supplemental Punctuation',
 'Sutton SignWriting',
 'Syloti Nagri',
 'Syriac Supplement',
 'Tagalog',
 'Tagbanwa',
 'Tai Le',
 'Tai Tham',
 'Tai Viet',
 'Tai Xuan Jing Symbols',
 'Takri',
 'Tamil',
 'Tangut',
 'Tangut Components',
 'Telugu',
 'Thaana',
 'Thai',
 'Tibetan',
 'Tifinagh',
 'Tirhuta',
 'Transport and Map Symbols',
 'Ugaritic',
 'Unified Canadian Aboriginal Syllabics Extended',
 'Vai',
 'Variation Selectors',
 'Vedic Extensions',
 'Vertical Forms',
 'Yi Radicals',
 'Yi Syllables',
 'Yijing Hexagram Symbols',
 'Zanabazar Square']

This idea came from #25.

stenskjaer commented 6 years ago

As noted in the closing commit message: The current re module has problems with the identification of word characters, cf. https://bugs.python.org/issue1693050 and https://bugs.python.org/issue12731.

Moving to the regex module means compliance with the Unicode 10 definition of words, as per https://www.unicode.org/reports/tr29/#Word_Boundaries.

This solves the need for indicating a language, as this should result in a complete coverage of any language with the already existing \w match group.