uakarsh / latr

Implementation of LaTr: Layout-aware transformer for scene-text VQA,a novel multimodal architecture for Scene Text Visual Question Answering (STVQA)
https://uakarsh.github.io/latr/
MIT License
52 stars 7 forks source link

word embedding layer #1

Closed youngsheen closed 2 years ago

youngsheen commented 2 years ago

Thanks for your Implementation. I notice that you didn't use the pre-trained word embedding layer by t5, but instead a random init embedding layer. Does this get a better result?

self.language_emb = nn.Embedding(config['vocab_size'], config['hidden_state'])

uakarsh commented 2 years ago

Actually, I am not sure, if the language_embedding was initialized with the weights of T5's word Embedding (although, that would be just a small modification in the repo), but if you go to page No. 3, and below equation 1 in the paper, they just mentioned that the lookup table would be, learnable, without mentioning the fact that, the weights were initialized by T5's embedding layer.

Did you find somewhere in the paper, mentioning that the language embedding later's weights were initialized with T5's embedding layer?

youngsheen commented 2 years ago

The paper didn't mention the settings. However, in the supplemental material, I noticed that the authors used a vocabulary of 32,000 wordpieces, which is different from the pre-trained vocab size of 32,128.

BTW, in the fine-tuning model, you initialize a new question embedding layer self.question_emb = nn.Embedding(config['vocab_size'], config['hidden_state']) but the paper mentions that the question tokens use the same embedding layer as ocr tokens.

We embed the OCR tokens and questions using Eq. (1) to obtain encoded OCR tokens E and encoded question features E^q

uakarsh commented 2 years ago

Actually, the reason why I used 32,128 is due to the fact that, T5Tokenizer uses 32,128 tokens (mentioned in the Hugging Face docs of T5), so that is the reason why I used 32,128 and maybe they have rounded of the numbers, and did not mentioned exact numbers

Maybe, for your second part, I think I misunderstood it, I would modify it shortly. Thanks for that!!!

Regards, Akarsh