Open Ybisalt opened 2 weeks ago
The lowercasing is simply does via the TensorFlow operation tf.strings.lower
, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead. A workaround you could use is to expressing lowercasing via a regex and then use tf.strings.regex_replace
, inside your own standardize
function passed to TextVectorization
.
The lowercasing is simply does via the TensorFlow operation
tf.strings.lower
, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead.
It's not because of tf.strings.lower()! tf.strings.lower() works properly with encoding='utf-8'.
t = tf.constant("Ff Zz Бб Яя")
print(t) # tf.Tensor(b'Ff Zz \xd0\x91\xd0\xb1 \xd0\xaf\xd1\x8f', shape=(), dtype=string)
tl_1 = tf.strings.lower(t)
tl_2 = tf.strings.lower(t, encoding='utf-8')
print(tl_1.numpy().decode('utf-8')) # ff zz Бб Яя
print(tl_2.numpy().decode('utf-8')) # ff zz бб яя
By default: tf.keras.layers.TextVectorization(encoding='utf-8') It's looks like TextVectorization does not pass the encoding to tf.strings.lower()
Can someone check if TextVectorization does pass the encoding argument to tf.strings.lower() ?
keras.layers.TextVectorization does not convert Cyrillic characters to lowercase with 'lower_and_strip_punctuation'. Deprecated keras.preprocessing.text.Tokenizer does this.