Clarification on Consuming Text as Categorical Sets

tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

Apache License 2.0

663 stars 110 forks source link

Hello,

The intermediate_colab ("Combine With Other Models") tutorial does a good job at showing how to preprocess a string to a categorical set. This is the example function provided:

def prepare_dataset(example):
  label = (example["label"] + 1) // 2
  return {"sentence" : tf.strings.split(example["sentence"])}, label

train_ds = all_ds["train"].batch(64).map(prepare_dataset)
test_ds = all_ds["validation"].batch(64).map(prepare_dataset)

From my understanding, tf.strings.split isn't the best way of doing this because it wont drop duplicates. For example, a text feature “The TV is the best” would be represented by {"The","TV","is","the","best"} when using tf.string.split. According to this article, it should instead be transformed to the following categorical set: {“best”, “is”, “the”, “TV}."

Is dropping duplicates necessary?

tensorflow / decision-forests

Clarification on Consuming Text as Categorical Sets #46