Closed ageofneil closed 3 years ago
Hi ageofneil,
You are correct that the algorithm uses sets, i.e. {"best", "is", "the", "TV"}. Currently the deduplication happens as part of the tensorflow training op, so it doesn't matter if the input tensor (e.g. one produced by tf.strings.split) contains duplicates. If anything, deduplicating in a separate op will add unnecessary computation time.
Hope that helps! I'm closing this issue, feel free to reopen if you would like more clarification.
Hello,
The intermediate_colab ("Combine With Other Models") tutorial does a good job at showing how to preprocess a string to a categorical set. This is the example function provided:
From my understanding,
tf.strings.split
isn't the best way of doing this because it wont drop duplicates. For example, a text feature “The TV is the best” would be represented by {"The","TV","is","the","best"} when using tf.string.split. According to this article, it should instead be transformed to the following categorical set: {“best”, “is”, “the”, “TV}."Is dropping duplicates necessary?