Tokenizer's `num_words` filtering is based on word's index

keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.

Other

1.02k stars 443 forks source link

Tokenizer's `num_words` filtering is based on word's index #311

Open pierreelliott opened 4 years ago

pierreelliott commented 4 years ago

In the method texts_to_sequences_generator (of the Tokenizer), the num_words check is based on the word's index. I understand that this check is fast, but wouldn't it be a problem if the ordering is changed (ie, if it isn't based on frequency anymore) ?

https://github.com/keras-team/keras-preprocessing/blob/5949df1c059a53d98a6004d5bfc93708e5ec6c4a/keras_preprocessing/text.py#L333-L340

Dref360 commented 4 years ago

Hello, Note, I'm far from an expert in NLP

Do you have an example where you wouldn't use frequency?

As long as word_index is sorted in order of importance it should work I think.

pierreelliott commented 4 years ago

Hi,

In my current project, we defined an external index/word mapping, as our dataset often change but not our vocabulary. So the tokens won't always be sorted in order of importance.

For the record, I don't need this particular method (yet, I think...) but I found the assumption on the data in the check a little bit "hard".