zero indicates end-of-sequence and unknown symbol in VocabularyProcessor

mheilman commented 8 years ago

Currently, in the VocabularyProcessor for text input, zero is used both as a padding symbol and also as the unknown word symbol.

This seems problematic since the padding symbol would be useful for inferring the sequence length (cf. #141).

It seems like it would be better to have the unknown token map to 1 (or maybe vocab_size) instead.

ilblackdragon commented 8 years ago

You are right. I'll update to 1 (I would prefer to have unknown token stable between different datasets etc - in case of special handling).

Note: code right now in the move to tensorflow.contrib.skflow - so I'll update already there.

mheilman commented 8 years ago

Hi, it looks like this is still an issue in tensorflow.contrib.learn. Shall I file an issue (or PR) there?

I think this could be addressed by changing this line to self._mapping = {None: 0, unknown_token: 1} (note: just having {unknown_token: 1} wouldn't work because the indices for new terms are chosen during fitting by len(self._mapping)).

Example using tensorflow 0.8.0:

In [115]: vp = VocabularyProcessor(10)

In [116]: vp.fit(["a dog ran in the park"])
Out[116]: <tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor at 0x118a3ac18>

In [117]: list(vp.transform(["a dog ran in the park"]))
Out[117]: [array([1, 2, 3, 4, 5, 6, 0, 0, 0, 0])]

In [118]: list(vp.transform(["a cat ran in the park"]))
Out[118]: [array([1, 0, 3, 4, 5, 6, 0, 0, 0, 0])]

aloknayak29 commented 7 years ago

Yes, This is still an issue. I want to know, Will someone change this code in future? Or They are happy with the current implementation of treating unknown tokens and padding as same?

ilblackdragon commented 7 years ago

Sorry, this code was removed from Tensorflow in favor of processing with either tools in scikit-learn or future tools in Tensorflow.

tensorflow / skflow

zero indicates end-of-sequence and unknown symbol in VocabularyProcessor #142