tensorflow / text

Making text a first-class citizen in TensorFlow.
https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro
Apache License 2.0
1.23k stars 344 forks source link

Feature request: Modify `text.regex_split_with_offsets()` behavior to be in line with `tf.strings.length()` #1245

Open briango28 opened 10 months ago

briango28 commented 10 months ago

text.regex_split_with_offsets() currently returns begin and end as tf.int64 tensors that count indices in bytes.

tf.strings.length() on the other hand, returns a tf.int32 tensor which counts lengths in either bytes or UTF8 characters according to the value of the parameter unit.

So this would actually be two separate requests:

  1. Change the return types of text.regex_split_with_offsets() to tf.int32, removing the need for a cast when comparing with tf.strings.length(). I doubt there will be a use case for strings longer than INT32_MAX in the foreseeable future.
  2. Add parameter unit: Literal["BYTE", "UTF8_CHAR"] = "BYTE" matching the behavior of tf.strings.length() and tf.strings.substr(). Seeing the regular expressions are already being interpreted in 'utf-8', I think it would make sense to add a layer of abstraction to facilitate slicing by UTF-8 character index.
briango28 commented 10 months ago

Follow-up

Having converted begin & end indices from BYTE to UTF8 with

offsets = tf.strings.unicode_decode_with_offsets(txt, 'UTF8')[1]
begin = tf.map_fn(lambda indices: tf.where(tf.expand_dims(indices, 1) == offsets)[:, 1], begin)

Where tf.strings.unicode_decode_with_offsets() returns offsets with type tf.int64, I'm not so sure about no. 1 anymore :/