Tokenizer respects `filters` when `char_level` is `True` - Githubissues

keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.

Other

1.02k stars 444 forks source link

Tokenizer respects `filters` when `char_level` is `True` #302

Open paw-lu opened 4 years ago

paw-lu commented 4 years ago

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview

[x] This PR requires new unit tests [y/n] (make sure tests are included)
[ ] This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
[x] This PR is backwards compatible [y/n]
[ ] This PR changes the current API [y/n] (all API changes need to be approved by fchollet)