Closed mheilman closed 7 years ago
You are right. I'll update to 1 (I would prefer to have unknown token stable between different datasets etc - in case of special handling).
Note: code right now in the move to tensorflow.contrib.skflow - so I'll update already there.
Hi, it looks like this is still an issue in tensorflow.contrib.learn
. Shall I file an issue (or PR) there?
I think this could be addressed by changing this line to self._mapping = {None: 0, unknown_token: 1}
(note: just having {unknown_token: 1}
wouldn't work because the indices for new terms are chosen during fitting by len(self._mapping)
).
Example using tensorflow 0.8.0:
In [115]: vp = VocabularyProcessor(10)
In [116]: vp.fit(["a dog ran in the park"])
Out[116]: <tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor at 0x118a3ac18>
In [117]: list(vp.transform(["a dog ran in the park"]))
Out[117]: [array([1, 2, 3, 4, 5, 6, 0, 0, 0, 0])]
In [118]: list(vp.transform(["a cat ran in the park"]))
Out[118]: [array([1, 0, 3, 4, 5, 6, 0, 0, 0, 0])]
Yes, This is still an issue. I want to know, Will someone change this code in future? Or They are happy with the current implementation of treating unknown tokens and padding as same?
Sorry, this code was removed from Tensorflow in favor of processing with either tools in scikit-learn or future tools in Tensorflow.
Currently, in the VocabularyProcessor for text input, zero is used both as a padding symbol and also as the unknown word symbol.
This seems problematic since the padding symbol would be useful for inferring the sequence length (cf. #141).
It seems like it would be better to have the unknown token map to 1 (or maybe
vocab_size
) instead.