graydon / sixbit

a crate for small packed strings
16 stars 2 forks source link

extended latin codepages idea #4

Open joseluis opened 1 year ago

joseluis commented 1 year ago

I'm not sure how well this fits the current design but it may be useful to have 2 additional extended latin codepages, by adding common diacritics and common special characters for european characters, by dividing the uppercase and lowercase characters in different pages. E.g. for the lowercased:

pub(crate) const LOWER_LATIN_EXT : [char; 64] = [
    '\0', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '_',
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
    'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'

    // 20x diacritics:
    // acute accent: en, es, nl
    'á', 'é', 'í', 'ó', 'ú',
    // grave accent: en, fr, ca, it, pt
    'à', 'è', 'ì', 'ò', 'ù',
    // diaeresis: ca, de, es, en, eu, gl, nl, pt, sw
    'ä', 'ë', 'ï', 'ö', 'ü',
    // circumflex: fr, pt
    'â', 'ê', 'î', 'ô', 'û',

    // 6x special chars
    'ñ', // es, gl, 
    'ç', // ca, eu, fr, pt
    'ß', // de
    'ø', // da, no
    'æ', // da, en, no
    'å', // da, no, fi, sw

    // not enough space for:
    // ã, õ: pt
    // ð: is
    // ý, ǫ: fo
];
joseluis commented 1 year ago

Another possible variation would gain more useful space for by removing the numbers. It would make enough space to complete the portuguese, icelandic and esperanto alphabets.

pub(crate) const LOWER_LATIN_EXT_B : [char; 64] = [
    // 28
    '\0', '_',
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
    'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'

    // vowels with diacritics: 24
    //
    // acute accent: en, es, nl
    'á', 'é', 'í', 'ó', 'ú',
    // grave accent: en, fr, ca, it, pt
    'à', 'è', 'ì', 'ò', 'ù',
    // diaeresis: ca, de, es, en, eu, gl, nl, pt, sw
    'ä', 'ë', 'ï', 'ö', 'ü',
    // circumflex: fr, pt
    'â', 'ê', 'î', 'ô', 'û',
    // tilde: fr, pt
    'ã', 'õ',
    // circle: da, fi, no, sw
    'å',
    // breve: eo, be
    'ŭ',

    // consonats with diacrics: 6
    //
    // circumflex: eo
    'ĉ', 'ĝ', 'ĥ', 'ĵ', 'ŝ',
    // tilde: es, gl
    'ñ',

    // other: 5
    'ç', // ca, eu, fr, pt
    'ß', // de
    'ø', // da, no
    'æ', // da, en, no
    'ð', // is

    // there's space for 1 more
    ' '
];

EDIT: forgot about ŝ!