Why we build tokenizer using both train and test set?

songyouwei / ABSA-PyTorch

Aspect Based Sentiment Analysis, PyTorch Implementations. 基于方面的情感分析，使用PyTorch实现。

MIT License

1.99k stars 522 forks source link

Why we build tokenizer using both train and test set? #190

Open minhdang241 opened 3 years ago

minhdang241 commented 3 years ago

I don't think we should build the tokenizer with both train and test set for non-BERT model situations. The chances are there will be words in the production environment which are not available in our vocabulary. If we use the test set to build the vocabulary, the performance will be bias.

GeneZC commented 3 years ago

While we have built tokenizer for test set, the corresponding embeddings will be randomly initialized for words that do not exist in the pre-trained word embedding. That is, if a word is not available in the vocabulary of the pre-trained embedding, it will be randomly initialized.