tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
663 stars 110 forks source link

Clarification on Consuming Text as Categorical Sets #46

Closed ageofneil closed 3 years ago

ageofneil commented 3 years ago

Hello,

The intermediate_colab ("Combine With Other Models") tutorial does a good job at showing how to preprocess a string to a categorical set. This is the example function provided:

def prepare_dataset(example):
  label = (example["label"] + 1) // 2
  return {"sentence" : tf.strings.split(example["sentence"])}, label

train_ds = all_ds["train"].batch(64).map(prepare_dataset)
test_ds = all_ds["validation"].batch(64).map(prepare_dataset)

From my understanding, tf.strings.split isn't the best way of doing this because it wont drop duplicates. For example, a text feature “The TV is the best” would be represented by {"The","TV","is","the","best"} when using tf.string.split. According to this article, it should instead be transformed to the following categorical set: {“best”, “is”, “the”, “TV}."

Is dropping duplicates necessary?

arvnds commented 3 years ago

Hi ageofneil,

You are correct that the algorithm uses sets, i.e. {"best", "is", "the", "TV"}. Currently the deduplication happens as part of the tensorflow training op, so it doesn't matter if the input tensor (e.g. one produced by tf.strings.split) contains duplicates. If anything, deduplicating in a separate op will add unnecessary computation time.

Hope that helps! I'm closing this issue, feel free to reopen if you would like more clarification.