Open briango28 opened 10 months ago
Follow-up
Having converted begin
& end
indices from BYTE
to UTF8
with
offsets = tf.strings.unicode_decode_with_offsets(txt, 'UTF8')[1]
begin = tf.map_fn(lambda indices: tf.where(tf.expand_dims(indices, 1) == offsets)[:, 1], begin)
Where tf.strings.unicode_decode_with_offsets()
returns offsets with type tf.int64
, I'm not so sure about no. 1 anymore :/
text.regex_split_with_offsets()
currently returnsbegin
andend
astf.int64
tensors that count indices in bytes.tf.strings.length()
on the other hand, returns atf.int32
tensor which counts lengths in either bytes or UTF8 characters according to the value of the parameterunit
.So this would actually be two separate requests:
text.regex_split_with_offsets()
totf.int32
, removing the need for a cast when comparing withtf.strings.length()
. I doubt there will be a use case for strings longer than INT32_MAX in the foreseeable future.unit: Literal["BYTE", "UTF8_CHAR"] = "BYTE"
matching the behavior oftf.strings.length()
andtf.strings.substr()
. Seeing the regular expressions are already being interpreted in 'utf-8', I think it would make sense to add a layer of abstraction to facilitate slicing by UTF-8 character index.