Open mraduldubey opened 5 years ago
Hi @mraduldubey , You are right, the character embeddings are indeed initialized randomly. However, at training time, the loss is backpropagated all the way and the character embeddings are thus updated (thus using supervised learning).
Thanks @guillaumegenthial for the reply. This way the ground truth will be a vector representing the whole word. So, what is the ground truth here?
You train the network to predict the tags. Turns out some parameters of the network correspond to character embeddings, so these are trained to help the network predict the tags. So the ground truth is the tag, and the learned embeddings help predict this tag.
So, you mean that the word representation n/w, the contextual word representation n/w and the decoder, though mentioned separately in the blog, are trained simultaneously in conjunction with the ground truth being the tags and the backpropagation happens from the final layer back to the word representation n/w.
I have this conceptual doubt in the part where we are obtaining word level representations from characters using the final output of BiLSTM network. We are initializing the character embeddings using xavier_initialization which just ensures that the cells do not saturate. So, how do these random embeddings capture any meaningful information? And how is this network trained or is it unsupervised?