Closed youngsheen closed 2 years ago
Actually, I am not sure, if the language_embedding was initialized with the weights of T5's word Embedding (although, that would be just a small modification in the repo), but if you go to page No. 3, and below equation 1 in the paper, they just mentioned that the lookup table would be, learnable, without mentioning the fact that, the weights were initialized by T5's embedding layer.
Did you find somewhere in the paper, mentioning that the language embedding later's weights were initialized with T5's embedding layer?
The paper didn't mention the settings. However, in the supplemental material, I noticed that the authors used a vocabulary of 32,000 wordpieces, which is different from the pre-trained vocab size of 32,128.
BTW, in the fine-tuning model, you initialize a new question embedding layer
self.question_emb = nn.Embedding(config['vocab_size'], config['hidden_state'])
but the paper mentions that the question tokens use the same embedding layer as ocr tokens.
We embed the OCR tokens and questions using Eq. (1) to obtain encoded OCR tokens E and encoded question features E^q
Actually, the reason why I used 32,128 is due to the fact that, T5Tokenizer uses 32,128 tokens (mentioned in the Hugging Face docs of T5), so that is the reason why I used 32,128 and maybe they have rounded of the numbers, and did not mentioned exact numbers
Maybe, for your second part, I think I misunderstood it, I would modify it shortly. Thanks for that!!!
Regards, Akarsh
Thanks for your Implementation. I notice that you didn't use the pre-trained word embedding layer by t5, but instead a random init embedding layer. Does this get a better result?
self.language_emb = nn.Embedding(config['vocab_size'], config['hidden_state'])