martin-gorner / tensorflow-rnn-shakespeare

Code from the "Tensorflow and deep learning - without a PhD, Part 2" session on Recurrent Neural Networks.
Apache License 2.0
533 stars 249 forks source link

UTF-8 and richer character sets? #22

Closed pacohope closed 6 years ago

pacohope commented 6 years ago

Any hints on what to change to accommodate more than US ASCII? I am working with cookbook text that freely mingles French words (like entrée and à la môde) as well as the "vulgar fraction" characters (like ⅔ and ¼). They're unicode characters UTF-8 encoded. It seems like there are two parallel functions (convert_from_alphabet() and convert_to_alphabet()) that need to be adjusted manually to match. I don't really feel like enumerating every single possible Unicode character I might encounter, and putting it in the alphabet manually, though. Is there a simpler way?

fabienpesquerel commented 6 years ago

https://github.com/mapmeld/tensorflow-rnn-esperanto/blob/esperanto/my_txtutils.py

You can adapt the dictionary created for Esperento the way you want in order to add, french characters such as 'é', 'è', 'à' ... For instance you can replace the beginning of the my_txtutils.py by the following :

# Specification of the supported alphabet (subset of ASCII-7)
# 10 line feed LF
# 32-64 numbers and punctuation
# 65-90 upper-case letters
# 91-97 more punctuation
# 97-122 lower-case letters
# 123-126 more punctuation
# 127-138 Some French letters uppercase and lowercase
FrenchLetters = {
  'é': 127,
  'è': 128,
  'à': 129,
  'ç': 130,
  'œ': 131,
  "'": 132,
  'â': 133,
  'î': 134,
  'ï': 135,
  'ö': 136,
  'ô': 137,
  'É': 138
}
frenchOrdValues = { }

# allow lookup by ord() number
for letter in FrenchLetters:
    frenchOrdValues[ord(letter)] = FrenchLetters[letter]

def convert_from_alphabet(a):
    """Encode a character
    :param a: ord(one character)
    :return: the encoded value for the model
    """
    if a == 9:
        return 1
    if a == 10:
        return 127 - 30  # LF
    elif 32 <= a <= 126:
        return a - 30
    elif a in frenchOrdValues:
        # French letters
        return frenchOrdValues[a] - 30
    else:
        return 0  # unknown

# encoded values:
# unknown = 0
# tab = 1
# space = 2
# all chars from 32 to 126 = c-30
# LF mapped to 127-30
def convert_to_alphabet(c, avoid_tab_and_lf=False):
    """Decode a code point
    :param c: code point
    :param avoid_tab_and_lf: if True, tab and line feed characters are replaced by '\'
    :return: decoded character
    """
    if c == 1:
        return 32 if avoid_tab_and_lf else 9  # space instead of TAB
    if c == 127 - 30:
        return 92 if avoid_tab_and_lf else 10  # \ instead of LF
    if 32 <= c + 30 <= 126:
        return c + 30
    elif 127 <= c + 30 <= 138:
        for ordValue in frenchOrdValues:
            if frenchOrdValues[ordValue] == c + 30:
                return ordValue
    return 0  # unknown

(I speak french by the way, so you can ask me whatever you want in french if you prefer but this could be unfair to other people that might want to access that information as well :) )

martin-gorner commented 6 years ago

Yes, if you want change the alphabet to adapt it to your language, you have to hack the my_txtutils.py file. I am not planning on extending the alphabet for this code sample. It is an educational sample and it has to remain simple.