Open tRosenflanz opened 6 years ago
From: https://github.com/keras-team/keras/issues/10768 by @hadaev8
Tokenizer will fit/transform the string into chars if a string is provided to fit_on_texts/text_to_sequences methods regardless of char_level setting. This is happening because the method expects a list of strings and is splitting the string into chars if just 1 string is given in this line for fitting: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L205
fit_on_texts
text_to_sequences
and this one for trasnforming: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L293
Reproducible code illustrating the problem with fit_on_texts:
from keras.preprocessing.text import Tokenizer text='check check fail' tokenizer = Tokenizer() tokenizer.fit_on_texts(text) tokenizer.word_index
Output:
{'c': 1, 'h': 2, 'e': 3, 'k': 4, 'f': 5, 'a': 6, 'i': 7, 'l': 8}
wrapping text into list solves the issue
tokenizer.fit_on_texts([text]) tokenizer.word_index
{'check': 1, 'fail': 2}
I can recommend checking that text is a list of strings and if it is not producing a warning and wrapping it into the list or erroring out
Thanks for the tip
From: https://github.com/keras-team/keras/issues/10768 by @hadaev8
Tokenizer will fit/transform the string into chars if a string is provided to
fit_on_texts
/text_to_sequences
methods regardless of char_level setting. This is happening because the method expects a list of strings and is splitting the string into chars if just 1 string is given in this line for fitting: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L205and this one for trasnforming: https://github.com/keras-team/keras-preprocessing/blob/e002ebd40e888965686e8946acefe02f5a910576/keras_preprocessing/text.py#L293
Reproducible code illustrating the problem with fit_on_texts:
Output:
wrapping text into list solves the issue
{'check': 1, 'fail': 2}
I can recommend checking that text is a list of strings and if it is not producing a warning and wrapping it into the list or erroring out