keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

Tokenizer respects `filters` when `char_level` is `True` #302

Open paw-lu opened 4 years ago

paw-lu commented 4 years ago

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview