kdexd / virtex

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations
http://kdexd.xyz/virtex
MIT License
556 stars 61 forks source link

Question about SentencePiece [SOS] and [EOS] ID. #12

Closed nooralahzadeh closed 4 years ago

nooralahzadeh commented 4 years ago

Hi, I saw that in SentencePieceTrainer, as below you made EOS and BOS and MASK and PAS tokens equal to Zero " --bos_id=-1 --eos_id=-1" " --control_symbols=[SOS],[EOS],[MASK]" However, during the captioning, you define sos_index: int = 1, eos_index: int = 2, I am wondering if these setups , have any effects?

kdexd commented 4 years ago

By default, SentencePieceTrainer assigns ID 1 as <s> and ID 2 as </s> . Check here

I prefer [SOS] and [EOS] in text instead of <s> and </s>, so I passed my custom symbols as --control-symbols. Internally, SentencePieceTrainer reserves ID 0 for <unk> (which cannot be changed as far as I know), and other control symbols are assigned from ID 3 (in presence of default <s> and </s>) or ID 1 (in absence of <s> and </s>).

So I turned off default <s> and </s>, and instead provided [SOS] and [EOS] so they get ID 1 and 2 respectively.

kdexd commented 4 years ago

Edited title for others to search easily. :-)

nooralahzadeh commented 4 years ago

Thanks. How about [PAD]'s id, is it zero by default?

kdexd commented 4 years ago

[PAD] and <unk> are same — we use the same token to right-pad captions and represent out-of-vocabulary tokens, similar to recent image captioning models. It is ID 0 by default. As far as I remember, I always refer its token as <unk>, and the corresponding variable name in code is padding_idx.