richliao / textClassifier

Text classifier for Hierarchical Attention Networks for Document Classification
Apache License 2.0
1.07k stars 379 forks source link

Tokenization performed with validation data (HATT) #26

Closed mukherjee-d closed 6 years ago

mukherjee-d commented 6 years ago

Thank you for your implementation! I have a question -in your code for the hierarchical attention model, you use both the training and validation data to form the tokenizer. Won't this bias your model?

richliao commented 6 years ago

I'm not sure if I understand your question but the sequence of embedding and embeddings are what the model is trying to learn. If a word is out of vocabulary, the embedding will be never trained anyway. You can even fix the embedding during the training given some of SOTA embedding training approach during the supervised learning.