keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

Tokenizer.fit_on_text splits 1 string into chars when char_level=False #27

Open tRosenflanz opened 5 years ago

tRosenflanz commented 5 years ago

From: https://github.com/keras-team/keras/issues/10768 by @hadaev8

Tokenizer will fit/transform the string into chars if a string is provided to fit_on_texts/text_to_sequences methods regardless of char_level setting. This is happening because the method expects a list of strings and is splitting the string into chars if just 1 string is given in this line for fitting: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L205

and this one for trasnforming: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L293

Reproducible code illustrating the problem with fit_on_texts:

from keras.preprocessing.text import Tokenizer
text='check check fail'
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
tokenizer.word_index

Output:

{'c': 1, 'h': 2, 'e': 3, 'k': 4, 'f': 5, 'a': 6, 'i': 7, 'l': 8}

wrapping text into list solves the issue

tokenizer.fit_on_texts([text])
tokenizer.word_index

{'check': 1, 'fail': 2}

I can recommend checking that text is a list of strings and if it is not producing a warning and wrapping it into the list or erroring out

Somaya-Alshare commented 4 years ago

Thanks for the tip