lmoroney / dlaicourse

Notebooks for learning deep learning
5.66k stars 5.36k forks source link

Tokenize new sentences will change the word_index because of the word frequency #35

Open authwork opened 5 years ago

authwork commented 5 years ago

I encountered a new case. I first tokenize a set of sentences and then tokenize another set of sentences. The word_index will change because of the word frequency. For instance.

sentences = [
'i love my dog',
'I, love my cat',
'You love my dog!',
'Do you think my dog is amazing?',
]
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)
================================
{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
================================

sen = [
'there is a big car',
'there is a big cat',
'there is a big dog',
]
tokenizer.fit_on_texts(sen)
word_index = tokenizer.word_index
================================
{'<OOV>': 1, 'my': 2, 'dog': 3, 'is': 4, 'love': 5, 'there': 6, 'a': 7, 'big': 8, 'i': 9, 'cat': 10, 'you': 11, 'do': 12, 'think': 13, 'amazing': 14, 'car': 15}
================================

I have some concerns. When I want to train a model with the updated news, I need to tokenize new sentences. Because of changing the word_index, I need to train all old news again with the new word_index. Is there an easier way to keep the old word_index and only add new indexes for new words only?