TextVectorization does not convert Cyrillic characters to lowercase

keras-team / keras

Deep Learning for humans

http://keras.io/

Apache License 2.0

61.06k stars 19.35k forks source link

TextVectorization does not convert Cyrillic characters to lowercase #19668

Open Ybisalt opened 2 weeks ago

Ybisalt commented 2 weeks ago

keras.layers.TextVectorization does not convert Cyrillic characters to lowercase with 'lower_and_strip_punctuation'. Deprecated keras.preprocessing.text.Tokenizer does this.

#==========================================

from tensorflow.keras.layers import TextVectorization

tokenizer = TextVectorization(split='character', standardize='lower_and_strip_punctuation')

tokenizer.adapt(["Zz, Aa"])   # Latin
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'z', 'a', ' ']

tokenizer.adapt(["Яя, Аа"])   # Cyrillic
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'я', 'а', 'Я', 'А', ' ']

from tensorflow.keras.preprocessing.text import Tokenizer  # deprecated
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(["Яя, Аа"])   # Cyrillic
print(tokenizer.index_word)   # {1: 'я', 2: 'а', 3: ',', 4: ' '}

#==========================================

fchollet commented 1 week ago

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead. A workaround you could use is to expressing lowercasing via a regex and then use tf.strings.regex_replace, inside your own standardize function passed to TextVectorization.

Ybisalt commented 1 week ago

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead.

It's not because of tf.strings.lower()! tf.strings.lower() works properly with encoding='utf-8'.

t = tf.constant("Ff Zz Бб Яя")
print(t)     # tf.Tensor(b'Ff Zz \xd0\x91\xd0\xb1 \xd0\xaf\xd1\x8f', shape=(), dtype=string)

tl_1 = tf.strings.lower(t)
tl_2 = tf.strings.lower(t, encoding='utf-8')

print(tl_1.numpy().decode('utf-8'))     # ff zz Бб Яя
print(tl_2.numpy().decode('utf-8'))     # ff zz бб яя

By default: tf.keras.layers.TextVectorization(encoding='utf-8') It's looks like TextVectorization does not pass the encoding to tf.strings.lower()

Ybisalt commented 2 hours ago

Can someone check if TextVectorization does pass the encoding argument to tf.strings.lower() ?